What is the encoding of the text? I read as utf-8, but see incorrect encoding in at collections.tsv and for some passage in triples.train.small, I was not able to find its passageId in collections.tsv.
In top1000.train, there are NOT 1000 paragraphs for each query, not only 1-10 passages. While in top1000.dev, most queries (but not all) do have 1000 passages.
In qrels.train, there are 4 columns, from the documentation, column 0 is queryID, column 2 is passageID, what are column 1&3? Which one is is_selected? How about the other? Here are examples from the documention.
The encoding of the text should be utf-8 but it it seems that there may have been some corruption at some point. If there are differences there should be minor. If you give me some specfics I can go research
Yes this is true. There are a few queries where BM25 did no return 1000 results and thus some results do not have 1000 paragraphs. This should be a big issue
Columns 1 and 3 are residuals from joins that we did not remove. Please go ahead an ignore them. Any pair there represents the is_selected. aka QID 0 PID 1 means that the PID had is_selected set to 1 for the QID.
Hey @dfcf93, I get some questions:
utf-8
, but see incorrect encoding in atcollections.tsv
and for some passage intriples.train.small
, I was not able to find its passageId incollections.tsv
.top1000.train
, there are NOT 1000 paragraphs for each query, not only 1-10 passages. While intop1000.dev
, most queries (but not all) do have 1000 passages.qrels.train
, there are 4 columns, from the documentation, column 0 is queryID, column 2 is passageID, what are column 1&3? Which one isis_selected
? How about the other? Here are examples from the documention.