spacemanidol / MSMARCO

Utilities, Baselines, Statistics and Descriptions Related to the MSMARCO DATASET
MIT License
189 stars 41 forks source link

encoding, top1000.train, qrels.train #25

Closed linxihui closed 5 years ago

linxihui commented 5 years ago

Hey @dfcf93, I get some questions:

  1. What is the encoding of the text? I read as utf-8, but see incorrect encoding in at collections.tsv and for some passage in triples.train.small, I was not able to find its passageId in collections.tsv.
  2. In top1000.train, there are NOT 1000 paragraphs for each query, not only 1-10 passages. While in top1000.dev, most queries (but not all) do have 1000 passages.
  3. In qrels.train, there are 4 columns, from the documentation, column 0 is queryID, column 2 is passageID, what are column 1&3? Which one is is_selected? How about the other? Here are examples from the documention.
    1185868 0       16      1
    597651  0       49      1
    403613  0       60      1
spacemanidol commented 5 years ago

Hey Sorry for the delay.

  1. The encoding of the text should be utf-8 but it it seems that there may have been some corruption at some point. If there are differences there should be minor. If you give me some specfics I can go research
  2. Yes this is true. There are a few queries where BM25 did no return 1000 results and thus some results do not have 1000 paragraphs. This should be a big issue
  3. Columns 1 and 3 are residuals from joins that we did not remove. Please go ahead an ignore them. Any pair there represents the is_selected. aka QID 0 PID 1 means that the PID had is_selected set to 1 for the QID.