spacemanidol / MSMARCO

Utilities, Baselines, Statistics and Descriptions Related to the MSMARCO DATASET
MIT License
190 stars 41 forks source link

BM25 relevance values for top 1000 eval/dev? #21

Closed amirj closed 5 years ago

amirj commented 5 years ago

In the documents:

We collected all unique passages(without any normalization) to make a pool of ~8.8 million unique passages. Then, for each query from the existing MSMARCO splits(train,dev, and eval) we ran a standard BM25 to produce 1000 relevant passages. These were ordered by random so each query now has 1000 corresponding passages.

Why you ordered the top1000 retrieved docs by random and didn't store the BM25 relevance values?

spacemanidol commented 5 years ago

to encourage competitors to optimize their systems with their own bm25 values. Ive updated values for dev bm25 performance. I've also added the scope script used to initially retrieve the bm25 documents so that should help .