shuuchen / DetCo.pytorch

A PyTorch implementation of DetCo https://arxiv.org/pdf/2102.04803.pdf
MIT License
23 stars 2 forks source link

Some questions about local mlps and G2L learning. #6

Open czhaneva opened 2 years ago

czhaneva commented 2 years ago

Thank you very much for the code. I have some questions. (1) local MLPs. Take Resnet50 as an example, the feature dim of the last stage is 2048, according to the paper and the code, the in_dim of the local mlps will be 2048 9 = 18432. So the learnable parameters is 18423 18432 = 339,738,624 ~ 340 M >> Resnet50 backbone (25.5 M), Is it possible to train such a network ? And is it really reasonable to use such a huge MLPs ? I open this issue just for discussion. (2) G2L. I use this idea in other task, but I found both global and local streams could converge, but the g2l could not converge. I'd like to ask that have you met this situation ? Thank you again.

shuuchen commented 2 years ago

Hi, thanks.

1) This paper is about dense contrastive learning, which is computational intensive. The parameters might be redundant. I think better methods will emerge in near future.

2) In my experiment, all losses converge. I think sufficiency of training data is necessary.

czhaneva commented 2 years ago

Thanks for your reply! I have got the first point. For the second point, I'd like to do some discussion and hope for some help. The following is the visualization of all the losses (loss_g --> global loss, loss_tl --> local loss, loss_tl2g --> local to global loss). 0e436aa43f31fb348f18229cc9a0f29 We can observe a very strange phenomenon. Both loss_g and loss_tl had a loss stagnation during training (loss_g: 15--> 35 epochs, loss_tl: 3 --> 80 epochs). The loss_tl2g is very oscillating, it only tends to start declining towards the end of training. Considering the non-convergence of loss_t2gl, I tried to remove this loss to observe the other losses. image It doesn't seem to make much of a difference to the situation described above, but loss_tl converges better. My training data has about 15,000 samples for 60 classes. I've never seen this before and I'm putting my results here for discussion and for some help. Maybe I need to analyze the model more carefully.

shuuchen commented 2 years ago

Hi,

The results seems not bad.

How is your batch size ?

15,000 samples are good, have you tried more or less training data ? You can experiment on 10x or 1/10 amount of current data and check how the loss curves change.

According to the paper, more data is recommended. I used 120,000+ images.

You may also try more epochs. As shown in the tl2g plot, it is like the early stage of tl, and might decrease in future epochs.

czhaneva commented 2 years ago

Thank you for your advice, and I will try them.

The batch size is 128.

I also found that for linear evaluation, it could bring slight performance improvement.