Closed HarmanDotpy closed 1 year ago
Hi, it might not look obvious, but it is correct. Check the equation in the paper. Since we are taking the mean of the N x N matrix, the sum is divided by N x N. As per the equation, we want to divide it by N, so we multiply the mean by N instead.
I see, sorry i missed the mean(). however, is there a reason why you divide by N and not N^2. for eg. in eqn 3, there are N^2 terms in the numerator, so if we scale N to a large value as is normally done in CLIP training, this loss would scale linearly in N, which might become quite large for large N.
There is an extra cyclic loss hyperparameter that you can scale down accordingly if your batch size is high to control for the contribution from the Cyclic terms.
Hi,
I hope I'm not understanding this wrong, but I think there may be an issue with the following line in the train.py file:
inmodal_cyclic_loss = (logits_image_per_image - logits_text_per_text).square().mean() / (umodel.logit_scale.exp() * umodel.logit_scale.exp()) * batch_size
I think what this does is multiplies the batch size in the numerator instead of the denominator, and so for a large batch size e.g. 4096, the loss would become very large which might cause an issue (for me the learning did not progress with 4096 batch size).
I believe the correction should be simply pulling the batch size term inside the brackets
inmodal_cyclic_loss = (logits_image_per_image - logits_text_per_text).square().mean() / (umodel.logit_scale.exp() * umodel.logit_scale.exp() * batch_size)
similarly for crossmodal cyclic loss as well