encoding, top1000.train, qrels.train

spacemanidol / MSMARCO

Utilities, Baselines, Statistics and Descriptions Related to the MSMARCO DATASET

MIT License

189 stars 41 forks source link

Hey @dfcf93, I get some questions:

What is the encoding of the text? I read as utf-8, but see incorrect encoding in at collections.tsv and for some passage in triples.train.small, I was not able to find its passageId in collections.tsv.
In top1000.train, there are NOT 1000 paragraphs for each query, not only 1-10 passages. While in top1000.dev, most queries (but not all) do have 1000 passages.
In qrels.train, there are 4 columns, from the documentation, column 0 is queryID, column 2 is passageID, what are column 1&3? Which one is is_selected? How about the other? Here are examples from the documention.
```
1185868 0       16      1
597651  0       49      1
403613  0       60      1
```

spacemanidol / MSMARCO