spacemanidol / MSMARCO

Utilities, Baselines, Statistics and Descriptions Related to the MSMARCO DATASET
MIT License
190 stars 41 forks source link

Training data with QID and PID #22

Closed QingyaoAi closed 5 years ago

QingyaoAi commented 5 years ago

Thank you for creating such a great dataset for passage re-ranking!

I'm wondering if it is possible to release the top 1000 passages retrieved for each training/dev/test query with the corresponding QID and PID? The current training data is constructed with the raw text of queries and passages, which are too huge to use. Also, since the qrels files are actually constructed with QID and PID, it would make life much easier if the train/dev/test data are also constructed with QID and PID.

amirj commented 5 years ago

Yes, it’s a good idea. Some people may not be interested to passage contents and just want to train based on current top retrieval results.

spacemanidol commented 5 years ago

Sorry for the delay on this but I finally got you. https://msmarco.blob.core.windows.net/msmarcoranking/qidpidtriples.train.full.tar.gz