spacemanidol / MSMARCO

Utilities, Baselines, Statistics and Descriptions Related to the MSMARCO DATASET
MIT License
190 stars 41 forks source link

Full Document May be incorrect tokenization in document_text #18

Closed daltonj closed 5 years ago

daltonj commented 5 years ago

I am inspecting the contents of fulldocuments.json. I notice that the json content has possible issues with "document_text" field. In particular, that the text seems to be somewhat incorrect (possibly due to not inserting breaks?)

Examples: https://answers.yahoo.com/question/index?qid=20071007114826AAwCFvR example text: show moreFollow 3 answersAnswersRelevanceRatingNewestOldestBest Answer

http://childparenting.about.com/od/physicalemotionalgrowth/tp/Child-Development-Your-Eight-Year-Old-Child.htm .4 Cognitive DevelopmentTom Merton/Getty ImagesEight-year-old children are at a stage of intellectual development where they will be able to pay attention for longer periods of time.

spacemanidol commented 5 years ago

Its worth noting that the documents produced are not necisarily perfectly clean documents. The technology used to produce these documents is primarily focused on generating documents that only have the truly relevant content present. In other words the system that produces this tries its best to remove any menus, side bars, and other artifacts that may have term matches but arent relevant. Since its a learned model its behavior is not perfect.

daltonj commented 5 years ago

I agree - that make sense. I get that, but I would expect the model to insert spaces / something between structural parts of the document. Are these perhaps just getting removed as part of the conversion to json with one per line? For example, my guess is that these are coming from line breaks being removed and not replaced with a delimiter.