luyug / Condenser

EMNLP 2021 - Pre-training architectures for dense retrieval
Apache License 2.0
245 stars 23 forks source link

The relative weight of the MLM loss compared to the contrastive loss #7

Closed hyleemindslab closed 2 years ago

hyleemindslab commented 2 years ago

In the paper, Equation 7 indicates that both the MLM and contrastive losses are divided by the effective batch size, whose value would be equal to 2 * per_device_train_batch_size * world_size. But the MLM loss calculation code seems to divide the MLM loss by per_device_train_batch_size * world_size (line 227), since the CoCondenserDataset's __getitem__ method returns two spans belonging to the same document, thereby making the actual batch dimension larger by a factor of 2.

I feel like I am missing something. Could you please help me out? https://github.com/luyug/Condenser/blob/de9c2577a16f16504a661039e1124c27002f81a8/modeling.py#L219-L230 https://github.com/luyug/Condenser/blob/de9c2577a16f16504a661039e1124c27002f81a8/data.py#L177-L179

luyug commented 2 years ago

Line 227 is for gradient accumulation scaling, not averaging across batch examples, check out the trainer code https://github.com/luyug/Condenser/blob/de9c2577a16f16504a661039e1124c27002f81a8/trainer.py#L161-L185

hyleemindslab commented 2 years ago

Yes, I just expected the MLM loss for a sub-batch to be scaled by (# of spans in the sub-batch / # of spans in the local batch) so that the final gradient is w.r.t. the loss that is averaged across the spans in the batch, which I thought would be written as loss = loss * (float(hiddens.size(0)) / (2 * self.train_args.per_device_train_batch_size)). But I'm starting to think it may not be that important.

luyug commented 2 years ago

Right, there's a factor of 2. We didn't actually experiment a lot with how to interpolate; the current code seems to work fine. As training progress, with momentum stabilizing in the optimizer, I also expect that it won't be super important.