microsoft / SimMIM

This is an official implementation for "SimMIM: A Simple Framework for Masked Image Modeling".
https://arxiv.org/abs/2111.09886
MIT License
896 stars 83 forks source link

Confusion about fine-tune #11

Closed Breeze-Zero closed 2 years ago

Breeze-Zero commented 2 years ago

Thank you very much for your work. I tried to use your method to conduct pre-training on my data set, and then compared with the initialization model and imagenet supervised training model respectively. The results showed that the convergence speed of SimMIM pre-training model was similar to the initialization model, although the accuracy would be gradually better than the initialization model after several iterations. But not as good as the convergence of the supervised training model. Is your method as described by MAE in Fine-tune : still improves accuracy after many iterations? Looking forward to your reply.

ancientmooner commented 2 years ago

Thanks for your sharing and asking. Our findings is that if you want good results on down-stream tasks of your own, it is highly recommended that a second-stage supervised pretraining approach after SimMIM (or similar approaches such as MAE) is encouraged. This second-stage supervised pretraining will introduce additional semantics that will be helpful for other down-stream tasks. This is what we did for our 3B Swin V2 training: SimMIM + supervised (classification).

Breeze-Zero commented 2 years ago

Thanks for your sharing and asking. Our findings is that if you want good results on down-stream tasks of your own, it is highly recommended that a second-stage supervised pretraining approach after SimMIM (or similar approaches such as MAE) is encouraged. This second-stage supervised pretraining will introduce additional semantics that will be helpful for other down-stream tasks. This is what we did for our 3B Swin V2 training: SimMIM + supervised (classification).

Thank you for your reply in your busy schedule, but my question has not been solved at all. I am curious about why the training loss convergence speed is not much different from the initialization model in the downstream tasks such as segmentation after SimMIM pre-training (maybe part of the reason is that the segmentation network has half of the decoder parameters). Because I have tried the comparative learning self-supervision method like DINO before, its downstream task training loss convergence speed is very fast, so I feel confused about this, and I am also checking whether there is a problem in my operation.

Breeze-Zero commented 2 years ago

Supplement training data for reference

1642838225(1) 1642838294(1)
ancientmooner commented 2 years ago

I did not quite follow your steps. Is it the following comparison:

SimMIM pre-training + segmentation fine-tune (red) vs. supervised pre-training + segmentation fine-tune (blue)

Breeze-Zero commented 2 years ago

SimMIM pre-training backbone + segmentation fine-tune (red) vs. Initialization weight backbone + segmentation fine-tune (blue)

ancientmooner commented 2 years ago

SimMIM pre-training backbone + segmentation fine-tune (red) vs. Initialization weight backbone + segmentation fine-tune (blue)

Thank you for your clarification. In general, the model with pretraining will converge much faster.

Yes, it is probably because the head is heavy compared to backbone. Another possible explanation could be that this problem is relatively simple, that both methods converge very fast.

Asers387 commented 2 years ago

Thanks for your sharing and asking. Our findings is that if you want good results on down-stream tasks of your own, it is highly recommended that a second-stage supervised pretraining approach after SimMIM (or similar approaches such as MAE) is encouraged. This second-stage supervised pretraining will introduce additional semantics that will be helpful for other down-stream tasks. This is what we did for our 3B Swin V2 training: SimMIM + supervised (classification).

Would it be possible to explain what exactly you mean by "second-stage supervised pretraining"? Is there any documentation you could link concerning this? Thanks!

ywdong commented 2 years ago

Thanks for your sharing and asking. Our findings is that if you want good results on down-stream tasks of your own, it is highly recommended that a second-stage supervised pretraining approach after SimMIM (or similar approaches such as MAE) is encouraged. This second-stage supervised pretraining will introduce additional semantics that will be helpful for other down-stream tasks. This is what we did for our 3B Swin V2 training: SimMIM + supervised (classification).

Thank you for your reply in your busy schedule, but my question has not been solved at all. I am curious about why the training loss convergence speed is not much different from the initialization model in the downstream tasks such as segmentation after SimMIM pre-training (maybe part of the reason is that the segmentation network has half of the decoder parameters). Because I have tried the comparative learning self-supervision method like DINO before, its downstream task training loss convergence speed is very fast, so I feel confused about this, and I am also checking whether there is a problem in my operation.

I also have the same problem. @834799106 Do u solve the problem?