goel-shashank / CyCLIP

111 stars 14 forks source link

Possible bug in inmodal/crossmodal loss calculation #7

Closed HarmanDotpy closed 1 year ago

HarmanDotpy commented 1 year ago

Hi,

I hope I'm not understanding this wrong, but I think there may be an issue with the following line in the train.py file:

inmodal_cyclic_loss = (logits_image_per_image - logits_text_per_text).square().mean() / (umodel.logit_scale.exp() * umodel.logit_scale.exp()) * batch_size

I think what this does is multiplies the batch size in the numerator instead of the denominator, and so for a large batch size e.g. 4096, the loss would become very large which might cause an issue (for me the learning did not progress with 4096 batch size).

I believe the correction should be simply pulling the batch size term inside the brackets inmodal_cyclic_loss = (logits_image_per_image - logits_text_per_text).square().mean() / (umodel.logit_scale.exp() * umodel.logit_scale.exp() * batch_size) similarly for crossmodal cyclic loss as well

goel-shashank commented 1 year ago

Hi, it might not look obvious, but it is correct. Check the equation in the paper. Since we are taking the mean of the N x N matrix, the sum is divided by N x N. As per the equation, we want to divide it by N, so we multiply the mean by N instead.

HarmanDotpy commented 1 year ago

I see, sorry i missed the mean(). however, is there a reason why you divide by N and not N^2. for eg. in eqn 3, there are N^2 terms in the numerator, so if we scale N to a large value as is normally done in CLIP training, this loss would scale linearly in N, which might become quite large for large N.

Hritikbansal commented 1 year ago

There is an extra cyclic loss hyperparameter that you can scale down accordingly if your batch size is high to control for the contribution from the Cyclic terms.