microsoft / msmarco

website for MS Marco
https://microsoft.github.io/msmarco/.
Creative Commons Attribution 4.0 International
27 stars 16 forks source link

textually-duplicate passages in msmarco v2 #8

Open seanmacavaney opened 3 years ago

seanmacavaney commented 3 years ago

I noticed that there's a sizeable number of passages in the v2 corpus that have text that exactly matches other passages: ~27.8 million passages, which amounts to around 20% of all passages in the corpus. Sometimes it's extremely prevalent, with one passage even being repeated 23,680 times [1]. [code] [file containing the duplicate passage IDs]

This is realistic, of course, since multiple documents often do contain the same passage. This is reflected in the other passage fields. I am wondering how this will affect evaluation, though. If I recall correctly, in the past NIST assessors evaluated the passage retrieval task irrespective of the context from the document. Is that the case again this year, or will the associated document also be considered? If only the passage text is considered, how will duplicates be handled?

[1] FWIW cases like this particular one (msmarco_passage_27_152452064, an advertising disclosure from Yellow Pages) are rather unlikely to be an answer to an actual question. Other exact duplicates are high-quality answers, though.

craswell commented 3 years ago

Yeah, we knew dupes will be an issue with the new datasets. It's a realistic dataset right now, but in such realistic situations we know there would be deduping mechanisms in the retrieval system, to prevent users from seeing duplicate stuff.

If people participate in the passage task by first ranking documents, then ranking the passages of the top-k documents, then we definitely don't want to remove any passages, just because they also appeared in some other doc. So if we wanted to do some unrealistic form of deduping, it might look more like having a single passage ID that points to many different documents, so no passage-document connections are lost.

Having said that, as soon as we get rid of exact dupes then there are "near dupes" to consider.

Overall our approach so far was to do some testing to make sure the collection is usable with the current training+dev sets, which seems to be the case, and we'll have to figure some further steps later. Thanks for raising the issue.