microsoft / MSMARCO-Passage-Ranking

MS MARCO(Microsoft Machine Reading Comprehension) is a large scale dataset focused on machine reading comprehension, question answering, and passage ranking. A variant of this task will be the part of TREC and AFIRM 2019. For Updates about TREC 2019 please follow This Repository Passage Reranking task Task Given a query q and a the 1000 most relevant passages P = p1, p2, p3,... p1000, as retrieved by BM25 a succeful system is expected to rerank the most relevant passage as high as possible. For this task not all 1000 relevant items have a human labeled relevant passage. Evaluation will be done using MRR
https://microsoft.github.io/MSMARCO-Passage-Ranking/
MIT License
292 stars 38 forks source link

Third version of Train Triples QID PID Format that mimics triples.train.full.tsv.gz #21

Open seanmacavaney opened 3 years ago

seanmacavaney commented 3 years ago

Per discussion here: https://github.com/microsoft/MSMARCO-Passage-Ranking/commit/4695a71c6c76ce85c07a51c0f12690cab19abbb0

The current version of qidpidtriples.train.full.2.tsv.gz has the same records as triples.train.full.tsv.gz, but they are in a different order.

It would be nice for these to be consistent so that those using these files as the training data sequence can control for the order of training in experiments.

seanmacavaney commented 3 years ago

fwiw it appears that the version of the qid/pid triples file prior to https://github.com/microsoft/MSMARCO-Passage-Ranking/commit/4695a71c6c76ce85c07a51c0f12690cab19abbb0 did have the triples in the same order as triples.train.full.tsv.gz (but some records were missing, which is what the change was about).

seanmacavaney commented 3 years ago

I think there are compelling reasons to have the qidpidtruples file in the same order as the triples file. But I also understand that this may seem somewhat pedantic and not be seen as a priority.

If I built this file for you, would you host it?