Intuitively, by using hard negatives, we are trying to push away random negatives with high logits away from the true positive. Since the negatives are random, isn't this forcing model at t+1 to be drastically different than the model at time t?
Also, Both of the seminal two-tower retrieval papers[1,2] don't mention any use of hard negative in the paper. Any guidance or insight on when they are useful and when they are not?
Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations
Mixed Negative Sampling for Learning Two-tower Neural Networks in Recommendations
Both implementations are based on taking the hardest negatives in the in-batch.
Only positive samples are fed into the model.
As the 2nd stage, the most difficult samples forward to softmax.
Intuitively, by using hard negatives, we are trying to push away random negatives with high logits away from the true positive. Since the negatives are random, isn't this forcing model at t+1 to be drastically different than the model at time t?
Also, Both of the seminal two-tower retrieval papers[1,2] don't mention any use of hard negative in the paper. Any guidance or insight on when they are useful and when they are not?