Reproduction issue of coCondenser NQ

texttron / tevatron

Tevatron - A flexible toolkit for neural retrieval research and development.

http://tevatron.ai

Apache License 2.0

509 stars 99 forks source link

Reproduction issue of coCondenser NQ #43

Closed SunSiShining closed 2 years ago

SunSiShining commented 2 years ago

I use the hard negative (hn.bert.json) you provided and I can reproduce R@5=75.8 But when I train with my own hard negatives, R@5 is only 64.3

How to generate hard negatives for NQ? Could you provide a reproduction setup?

Here is the setup for my mining hard negatives: Model: co-condenser-wiki trained with bm25-negative Negative depth: 200 Negative sample: 30

Looking forward to your reply!!! Thank you!

MXueguang commented 2 years ago

Hi @SunSiShining, I think in @luyug's setting the mined hard negatives are concatenated with original bm25-negatives. i.e.the train set is ~50k examples with bm25 negatives and ~50k examples with hard negatives. And during mining the hard negatives, probably also update the mined positive passages.

SunSiShining commented 2 years ago

Thank you for such a quick reply, much appreciated. @MXueguang

The final training data consists of ~58k training queries with ~90 bm25 negatives per query and ~70k training queries with ~30 hard negatives per query.

I have merged the bm25 training data, but I have not updated the mined positive passages. I'll check if this is the key to degrade performance.

thank you again :D