Hard Negative Mining vs random sampling

vahuja4 commented 1 year ago

Has anyone tried doing hard negative mining when generating the sentence pairs as opposed to random sampling? @tomaarsen - is random sampling the default?

tomaarsen commented 1 year ago

Random sampling for the negative pairs is the default, yes. My understanding is that this is a relatively hard to beat baseline. @danielkorat has done some research on different sampling approaches, and I believe he found that some of the seemingly clever sampling approaches were beaten by simple random sampling. However, I think he also found that there are some improvements to be made over purely random sampling.

I don't recall exactly if he tried finding hard negatives, but perhaps he can elaborate himself a bit, if he finds the time.

Tom Aarsen

adfindlater commented 1 year ago

I was wondering something similar. I have a n-class case where some of the classes will likely already be well separated in the un-tuned embedding space. It would be nice to bias sampling towards the pairs where I know a priori there is likely to be confusion in the downstream classification task.

huggingface / setfit

Hard Negative Mining vs random sampling #349