spacemanidol / MSMARCO

Utilities, Baselines, Statistics and Descriptions Related to the MSMARCO DATASET
MIT License
189 stars 41 forks source link

Collection paragraph metadata #19

Closed daltonj closed 5 years ago

daltonj commented 5 years ago

The collection.tsv contains the paragraph contents. How do we get the metadata? It's in the full documents, but I don't see how these can easily be linked.

spacemanidol commented 5 years ago

Ah. You have stumbled onto one of the bad parts of the dataset.

Short Answer. We are unable to release any metadata for This passage ranking task due to legal and temporal constrains. For subsequent releases what kind of metadata would you be interested in?

Long Answer:At best the full web documents and the paragraphs in collection.tsv are loosely joined soley by url. The web documents don't really serve a purpose per se but more serve in case you want to use them to genetate some embeddings or idf lookup table. I should probably write more about this in the README or datasheet. Basically you can think of the paragraph generation as a batch task. Essentially every N weeks we would run our query sampling script, generate the passages and send them off to judges. Then, Once the dataset was complete I took all the unique URLs in the dataset and ran them through a pipeline we have in bing to produce the documents. This document extraction happened early April 2018. Initially we wanted to use these documents to have a more difficult Q&A task but upon inspection, since there was a temporal disconect with passages and documents.

daltonj commented 5 years ago

Let me be more specific. The collections.tsv doesn't have basics - the URL, the Title; only text. These metadata fields are included in fulldocuments. But - how are these linked back to paragraphs?

Maybe I'm missing something obvious. A readme on the data and how to to use them / join them would be helpful. I believe some students have an earlier version where the paragraphs were grouped by query with titles / urls. But I don't see this for the ranking dataset. A description of what's in each would be helpful.

spacemanidol commented 5 years ago

Answered earlier on slack but the simple answer is there is no metadata in the collections file but you can do a lookup in the regular Q&A file and do a join on passages. Ill add this to my backlog to create.