keyu-tian / SparK

[ICLR'23 Spotlight🔥] The first successful BERT/MAE-style pretraining on any convolutional network; Pytorch impl. of "Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling"
https://arxiv.org/abs/2301.03580
MIT License
1.41k stars 82 forks source link

reducing pre-training to 200 epochs #66

Closed bollossom closed 8 months ago

bollossom commented 8 months ago

Hello, What a nice job SparsK is!!

  1. I am trying to reduce SparsK pre-training epochs from 1600 to 200. Will this have a big impact on the accuracy?
  2. At the same time, I would also like to ask you whether SparsK can be applied to Hybird CNN-transformer backbone like MC-MAE
keyu-tian commented 8 months ago

Thank you! 200 ep for pretraining can be a bit insufficient. You may need to adjust some hyperparameters like doubling the learning rate, decreasing the drop path rate, etc.

For Hybird CNN-transformer backbone, SparK can be directly applied on them because SparK does not change the model architecture or parameters. You can refer to our SparK to use sparse layers on CNN (like what we do in https://github.com/keyu-tian/SparK/blob/main/pretrain/encoder.py#L165) and use a multi-scale decoder to reconstruct images (like in https://github.com/keyu-tian/SparK/blob/main/pretrain/decoder.py).

bollossom commented 8 months ago

ok, Thank you very much!!!!

bollossom commented 8 months ago

Hello, I found during pre-training that using the sparse convolution you gave, it takes about 100 minutes to train one epoch. Is there anything I can do to speed this up? The training settings are: a model with a total size of about 70M, trained using 8 A100s with 60G memory. For example, use the sparse convolution implementation in MinkowskiEngine or write a cuda operator?

keyu-tian commented 8 months ago

@bollossom can you provide details on model type, dataset size and input size, batch size, and GPU utilization? BTW what does 60G video memory mean?

Generally, I believe using MinkowskiEngine won't speed up too much because 1) the masked images are way denser than 3d point clouds, and 2) the lack of optimization on sparse depthwise convolution, sparse group norm, etc.

bollossom commented 8 months ago

好的,我首先对我之前的错误的叙述抱歉,我们的模型是使用了sparsk的MC-MAE [base],bs为512,在imagenet上预训练,使用8张64GB的显卡跑,目前跑预训练一轮大概100分钟左右, 不知道有没有好的加速方式

keyu-tian commented 8 months ago

这个训练速度感觉应该有bug,可以排查下,比如看下GPU利用率和显存占用,或者用 pytorch profiler log一下哪里速度慢 作为参考,ConvNeXt-Base bs=4096 on 32 A100s,每个 epoch 约5分钟

bollossom commented 8 months ago

好的,谢谢您的细心指导,请问spark-ResNet-101, bs=4096,32 A100s,预训练大概多少分钟一轮呀

keyu-tian commented 8 months ago

和ConvNeXt-Base 差不多

bollossom commented 8 months ago

好的谢谢您