Hi! I found something a bit confusing that you might be able to clarify. For the contrastive loss implemented in "LecbertForPreTraining" class, you use a sigmoid as the activation function before applying the BCE loss but in the paper you mention a normal InfoNCE loss (with softmax activations) was used. Am I misunderstanding something perhaps?
Hi! I found something a bit confusing that you might be able to clarify. For the contrastive loss implemented in "LecbertForPreTraining" class, you use a sigmoid as the activation function before applying the BCE loss but in the paper you mention a normal InfoNCE loss (with softmax activations) was used. Am I misunderstanding something perhaps?