microsoft / TREC-2019-Deep-Learning

Website for the TREC Deep Learning Track 2019
https://microsoft.github.io/TREC-2019-Deep-Learning/
Creative Commons Attribution 4.0 International
87 stars 28 forks source link

question about passaga ranking data #15

Closed lich8990 closed 4 years ago

lich8990 commented 4 years ago
  1. it is noticed that "triples.train.full.tsv.gz " has 397,756,691 records while "qidpidtriples.train.full.tsv.gz" has 269,919,004 records. Since the latter file provides pid and qid of the full triples dataset, why the numbers are different? 2.it is known that "qrels" provides the positive sample of train data, is there any relevence of qrels dataset and full triples dataset. Is it true that the union of the whole positive sample in full triples dataset can be found in qrels files? or the two dataset were generated in two different ways?
bmitra-msft commented 4 years ago
  1. Thanks for reporting this issue. Turns out there was some data loss when generating the qidpidtriples file. We've regenerated it and it's available at the following location: https://msmarco.blob.core.windows.net/msmarcoranking/qidpidtriples.train.full.2.tsv.gz Please let us know if you notice any other issues and thanks again for reporting the problem!

  2. Yes, the triples file was generated by pairing positive documents in the qrel file with the negative documents in the top100 file (after removing the positive docs from the latter).