spacemanidol / MSMARCO

Utilities, Baselines, Statistics and Descriptions Related to the MSMARCO DATASET
MIT License
190 stars 41 forks source link

Include statistics on ranking dataset in documentation #12

Closed drennings closed 5 years ago

drennings commented 5 years ago

Hi,

Responding to the request of feedback on the documentation, I have a suggestion.

To me it would have been helpful if the size of each split of the dataset were included in the documentation as listed in issue #11. Additionally, it would be interesting to include other characteristics of the dataset such as the average question length, the average passage length, the amount of unique passages included in the top 1000 ranking by BM25 (assuming this is a subset of the 8 million passages in the whole dataset).

Thanks in advance!

spacemanidol commented 5 years ago

I'll get on this

spacemanidol commented 5 years ago

There are 7555149 unique words There are 8841823 unique passages The average question length is 6.373235758460644 words with a range 1 to 75 words The average passage length is 56.25311069900404 words with a range 1 to 362 words Top 1000 dev contains 3895239 unique passages Top 1000 eval contains 3831719 unique passages