spacemanidol / MSMARCO

Utilities, Baselines, Statistics and Descriptions Related to the MSMARCO DATASET
MIT License
190 stars 41 forks source link

How were passage reranking triples generated? #14

Closed chsasank closed 5 years ago

chsasank commented 5 years ago

Hi,

It's not clear how triples were generated. Your documentation says:

These triples(availible in small and large(27 and 270gb respectively)) contain a query followed by a possitive passage and a negative passage.

But it also says

hen, the existing dataset has an annotation of is_selected:1 if a judge used a passage to generate their answer. We consider these as a ranking signla where all passages that have a value of 1 are a true possitive for query relevance for that given passage. Any passage that has a value of 0 is not a true negative.

How are negative passages generated if is_selected:0 is not true negative. Can you please open source the code used to generate these triples.

I think documentation for the dataset needs work. Given the usefulness of the dataset, it's a shame if people are unable to use it because of documentation.

spacemanidol commented 5 years ago

@chsasank I am working on opensourcing the scripts that were used to generate the entire dataset keep an eye posted next week.

spacemanidol commented 5 years ago

Hey, just an update. I have uploaded the script that we used to generate the ranking triples its a scope script but should be pretty easy to understand. https://github.com/dfcf93/MSMARCOV2/blob/master/Ranking/GenerateData.script