[critical bug] The text encoder is also updated.

SeongwoongCho commented 1 year ago

I found out that the text encoder is also updated. The positional embedding of the provided "denseclip_fpn_res50.pth" is tensor([[-0.0013, 0.0003, 0.0007, ..., -0.0027, -0.0091, -0.0024], [-0.0039, -0.0008, -0.0016, ..., -0.0006, -0.0049, -0.0044], [-0.0044, 0.0011, -0.0007, ..., -0.0026, -0.0094, -0.0008], ..., [-0.0002, -0.0002, -0.0012, ..., 0.0007, 0.0013, -0.0002], [-0.0016, -0.0015, -0.0001, ..., -0.0010, -0.0025, -0.0004], [-0.0030, -0.0013, -0.0004, ..., -0.0028, -0.0052, -0.0016]])

And the first 13 positional embedding of the pretrained RN50 model is tensor([[-0.0012, 0.0003, 0.0008, ..., -0.0027, -0.0090, -0.0024], [-0.0040, -0.0008, -0.0015, ..., -0.0006, -0.0049, -0.0045], [-0.0044, 0.0011, -0.0006, ..., -0.0025, -0.0093, -0.0007], ..., [-0.0002, -0.0002, -0.0011, ..., 0.0006, 0.0011, -0.0003], [-0.0018, -0.0016, -0.0002, ..., -0.0009, -0.0025, -0.0004], [-0.0031, -0.0014, -0.0006, ..., -0.0026, -0.0053, -0.0015]], device='cuda:0', grad_fn=)

, which is slightly different.

I guess the reason is that "lr_mult" does not guarantee zero LR. The learning rate of the text encoder may get bigger than 0 due to the internal behavior of the LR scheduler. I think this is quite a critical bug since it may affect the result of the ablation study (Table 2 in the paper).

Also, I have one more question: Why do you set lr_mult as 0 for 'norm'? As far as I know, the mmcv library tries to set learning_rate as 0 for every module which includes the key "norm". If it is right, every 'normalization layer' in the transformer layer (especially the context decoder) will be 0.

raoyongming commented 1 year ago

Hi @SeongwoongCho, thanks for your interest in our work and for pointing out this issue.

It seems that the weights are slightly changed after fine-tuning due to the issue of lr_mult. I think the overall conclusion in Table 2 is not changed, since the near 0 learning rate is better than standard fine-tuning. Besides, in the post-model prompting setting, we can still get the class embedding using the learned text-encoder after training and totally remove the text-encoder during inference.

We set the decay_multi instead of lr_mult to 0 for normalization layers, which is consistent with the implementation for ImageNet pre-training. Setting the weight decay for \alpha and \beta in normalization layers to 0 usually can improve the final performance.

SeongwoongCho commented 1 year ago

@raoyongming Thank you for your quick reply! I understand your points on both issues!

raoyongming / DenseCLIP

[critical bug] The text encoder is also updated. #28