spacemanidol / MSMARCO

Utilities, Baselines, Statistics and Descriptions Related to the MSMARCO DATASET
MIT License
189 stars 41 forks source link

Uncommon train / dev / test split of ranking dataset #11

Closed drennings closed 5 years ago

drennings commented 6 years ago

Hi,

I have two questions about the train/dev/test split of the ranking dataset. I noted that:

Now, my questions are:

  1. Why was roughly a 40:1:1 split made instead of e.g. a more common 8:1:1 split?
  2. Why do only (roughly) 55% of the queries in dev have an answer whereas 100% of the queries in train have an answer?

Thanks in advance!

spacemanidol commented 6 years ago

Hey.

Good eye. Seems I uploaded the wrong files but I have fixed it. We had initially subsampled the files when experimenting since the set is so big.

New queries 101093 queries.dev.tsv 101092 queries.eval.tsv 502939 queries.train.tsv new qrels 45684 qrels.dev.tsv 401023 qrels.train.tsv

The sizes are going to be a little different since for the train set we are removing all queries that do not have an answer(original train is ~800,000) but we have not removed these from dev and eval in order to keep these sets help out and not affecting the other msmarco task.

That being said, the percentage of queries that do not have answers is about the same across splits(~35%) so the sets are now matched.

drennings commented 6 years ago

Great, thanks for the fix and clarification!

Let me double check if I get things right:

If these statements are correct, I would wonder what the use would be of the questions in questions.train that actually have an answer in collection.tsv that is not mentioned in qrels.train. If each split should contain queries for which no answer exists, shouldn't queries.train then also be updated? For instance:

spacemanidol commented 5 years ago

Hey,

Sorry for closing this early.

You are correct those files were updated. It seems that the original queries.dev and queries.eval were just subsamples of the actual queries so I included the whole part. The qrels.train became smaller because it seems there was some normalization with the collection.tsv and the person who did that is on vacation. Once I fix this normalization error I will update the collection.tsv file and the qrels file. The expectede file should be ~550k for train, 56k for dev(and about the same for eval).

It worth noting that the queries.* files are only for ease of joining sets they are not used in evaluation. For evaluation your system will be reranking passages for a query where there is an answer. Your system score is based on how highly your system is able to rank the relevant passages(qrels). Since there are a few times where the BM25 model did not return the passage marked as relevant(few but they happen) a system will never be able to achieve a perfect 1 for MRR. I will post the theoretical maximum MRR for this dataset shortly.

drennings commented 5 years ago

Hey,

No problem at all, thanks for your reply.

By "normalization error", do you mean that there are now duplicate passages in collection.tsv (passages that have a different id but the same contents)? And that these duplicate documents will be removed from collection.tsv, so that the file will only contain unique passages, and that all qrels files will be updated accordingly?

Looking forward to the updated dataset!

spacemanidol commented 5 years ago

No by normalization I mean that some chars were removed in collection.tsv that weren't removed elsewhere the ids are constant and the size is the same.

If you go ahead and check the updated qrels you will now find the full files. erasmus@spacemanidol:~/MSMARCOV2/Ranking/Baselines/DataDir$ wc -l qrels.* 59273 qrels.dev.tsv 59187 qrels.eval.tsv 532761 qrels.train.tsv 651221 total There may be an update in the future to the dataset but for now feel free to have at it!