microsoft / ANCE

A novel embedding training algorithm leveraging ANN search and achieved SOTA retrieval on Trec DL 2019 and OpenQA benchmarks
MIT License
359 stars 49 forks source link

where is bm25 introduced? #16

Open tangzhy opened 3 years ago

tangzhy commented 3 years ago

Hi,

For the warm-up step, I see a regular dense retrieval model training on the triples.small data provided by MSMarco.

But I don't find any code introducing bm25 index and bm25 sampling. I guess you are treating triples.small data's negatives as bm25 negs already?

What does bm25 warm up mean? How is that introduced?

Thanks

juyongjiang commented 2 years ago

Hi,

For the warm-up step, I see a regular dense retrieval model training on the triples.small data provided by MSMarco.

But I don't find any code introducing bm25 index and bm25 sampling. I guess you are treating triples.small data's negatives as bm25 negs already?

What does bm25 warm up mean? How is that introduced?

Thanks

Yeah, I also can't find the BM25 index. Have you found the answer to it?

MewemeW commented 2 years ago

+1

robro612 commented 1 year ago

I believe @tangzhy is correct (at least on MSMARCO), the triples.train.small.tsv were generated by the MSMARCO dataset itself, and they refer to generating the triplets using BM25 in the raw text of the README, hence why there's no reference to BM25 in this repo.