About BM25 hard negatives

texttron / tevatron

Tevatron - A flexible toolkit for neural retrieval research and development.

http://tevatron.ai

Apache License 2.0

494 stars 94 forks source link

About BM25 hard negatives #87

Closed aken12 closed 1 year ago

aken12 commented 1 year ago

Hi :) Thank you for your great work! I read this issue (https://github.com/texttron/tevatron/issues/66#issuecomment-1473047705). In this description,

This downloads the cleaned corpus hosted by RocketQA team, generate BM25 negatives and tokenize train/inference data using BERT tokenizer. \
The process could take up to tens of minutes depending on connection and hardware.

does it means "Tevatron/msmarco-passage" is from cleaned msmarco? @MXueguang

I would like to know whether it uses just BM25 hard negatives or hard negatives with some added treatment.

MXueguang commented 1 year ago

Hi @aken12, "Tevatron/msmarco-passage" is using BM25 hard negatives but with 2 additional treatment

if the positive passage is not shown in top200 BM25 hits, the example is dropped, so results in ~400k examples rather than original 500k.
We use the augmented corpus from RocketQA, the passage is augmented by title

aken12 commented 1 year ago

@MXueguang Thank you for your kind response! In examples/coCondenser-marco/get_data.sh, I cannot find the code to select a positive passage.

1. if the positive passage is not shown in top200 BM25 hits, the example is dropped, so results in ~400k examples rather than original 500k.

Do we need to add a process to select positive passages, in addition to running get_data.sh? Or, is qidpidtriples.train.full.2.tsv.gz already processed? (I guess the latter is not correct.)

Thanks :)

MXueguang commented 1 year ago

it seems qidpidtriples.train.full.2.tsv.gz only has 400k queries.

aken12 commented 1 year ago

Oh, yes, that's true. I understand now, thank you!!