microsoft / TREC-2019-Deep-Learning

Website for the TREC Deep Learning Track 2019
https://microsoft.github.io/TREC-2019-Deep-Learning/
Creative Commons Attribution 4.0 International
87 stars 28 forks source link

questions about Passage ranking dataset #16

Closed yixuan-qiao closed 4 years ago

yixuan-qiao commented 4 years ago

i have some questions about Passage ranking dataset: a) there are approximately 40W different querys in triples.train.full.tsv.gz, are those all in qrels.train.tsv & qrels.dev.tsv files? b) furthermore, is every query-passage pair of qrels.train.tsv & qrels.dev.tsv in triples.train.full.tsv.gz? in other words, i do not know the way how triples.train.full.tsv.gz file generate. maybe for each query-passage pair of qrels.train.tsv & qrels.dev.tsv, just add a random negative sample or a sample from top k by BM25? c) for positive passages and negative passages in triples.train.full.tsv.gz, are those passages in collection.tar.gz? d) what difference between triples.train.full.tsv.gz and qidpidtriples.train.full.tsv.gz(397,756,691 vs 269,919,004) it seems that more than a format problem?

It seems that there is no particularly detailed documentation to answer me above information.

spacemanidol commented 4 years ago

a) they are all in qrels.train.tsv b) dev portion is not in triples.train.full.tsv.gz. The triple file was generated with all the negative-positive pairs in the top1000(positive is the qrels, negative is something not picked by BM25 as top 1000. c)yes d) qid is a processed form of the triples. It has the qids and pids instead of text.

yixuan-qiao commented 4 years ago

about question (d), i run some experiments found that triples file has 408684 different type positive passsages while qidpidtriples has 332094. As you said, the latter is index rather than text, why the record numbers are different? even the number of different query or positive passage or negative passage?

spacemanidol commented 4 years ago

there was an issue in creating the qid triples file. Some of the queries in triples.train.full.tsv.gz did not join with the queries.tsv file so initially we dropped them. We are regenerating the qid triples file to fix this.

yixuan-qiao commented 4 years ago

If i understand correctly, triples.train.full.tsv.gz file has more different query types than queries.tsv. Maybe this queries.tsv file and qid triples file both need to regenerate? If it is generated, will there be any notification on the website(https://microsoft.github.io/TREC-2020-Deep-Learning/)?

spacemanidol commented 4 years ago

no triples.train.full.tsv.gz has all the queries in the train portion of queries.tsv. The qid file needs to get regenerated. We will post an update in 2020 repo.

bmitra-msft commented 4 years ago

Thanks for reporting this issue. As Daniel mentioned, turns out there was some data loss when generating the qidpidtriples file. We've regenerated it and it's available at the following location: https://msmarco.blob.core.windows.net/msmarcoranking/qidpidtriples.train.full.2.tsv.gz

Please let us know if you notice any other issues and thanks again for reporting the problem!

yixuan-qiao commented 4 years ago

ok, i got it. Thanks. I'll let you know if I find any other problems.