keyu-tian / SparK

[ICLR'23 Spotlight🔥] The first successful BERT/MAE-style pretraining on any convolutional network; Pytorch impl. of "Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling"
https://arxiv.org/abs/2301.03580
MIT License
1.41k stars 82 forks source link

Self-supervised training time is too long compared to MAE #50

Closed HIT-SIRS closed 11 months ago

HIT-SIRS commented 1 year ago

Thanks for your work, we compared SparK with MAE (convnext-base v.s. swin-base), the training time of SparK is about 6.5 times that of MAE. Is there any way to improve training efficiency? Is the lengthy training time caused by insufficient hardware optimization of sparse convolution?

powermano commented 1 year ago

You can try the Self-supervised in Convnextv2, which may be more efficient.

keyu-tian commented 11 months ago

@DZ1533 @powermano MAE is more efficient due to the sparsity of Transformer and attention. Convnextv2 uses the same masked convolution like SparK so the efficiency would be MAE > SparK == ConvNeXtv2. The more efficient pretraining on CNNs is thus a challenge and still to be explored.

yxchng commented 11 months ago

@keyu-tian shouldn't convnextv2 be more efficient due to the use of minkovski engine which has optimized sparse operations?

keyu-tian commented 11 months ago

@yxchng convnextv2 uses masked (dense) conv which would be faster than minkovski engine, see https://github.com/facebookresearch/ConvNeXt-V2/blob/main/TRAINING.md#implementing-fcmae-with-masked-convolution-in-jax. I feel their implementation based on minkovski engine is for reference only. If you test it you may find in this masked pretraining scenario, using masked conv would be faster. ps: minkovski is largely optimized for 3d sparse voxels, which has a lot difference from 2d masked images, like 3d vs 2d, or 2d masked images have smaller and constant sparse ratios, which may make standard conv faster than sparse minkovski conv.