Self-supervised training time is too long compared to MAE

keyu-tian / SparK

[ICLR'23 Spotlight🔥] The first successful BERT/MAE-style pretraining on any convolutional network; Pytorch impl. of "Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling"

https://arxiv.org/abs/2301.03580

MIT License

1.46k stars 84 forks source link

Self-supervised training time is too long compared to MAE #50

Closed HIT-SIRS closed 1 year ago

HIT-SIRS commented 1 year ago

Thanks for your work, we compared SparK with MAE (convnext-base v.s. swin-base), the training time of SparK is about 6.5 times that of MAE. Is there any way to improve training efficiency? Is the lengthy training time caused by insufficient hardware optimization of sparse convolution?

powermano commented 1 year ago

You can try the Self-supervised in Convnextv2, which may be more efficient.

keyu-tian commented 1 year ago

@DZ1533 @powermano MAE is more efficient due to the sparsity of Transformer and attention. Convnextv2 uses the same masked convolution like SparK so the efficiency would be MAE > SparK == ConvNeXtv2. The more efficient pretraining on CNNs is thus a challenge and still to be explored.

yxchng commented 1 year ago

@keyu-tian shouldn't convnextv2 be more efficient due to the use of minkovski engine which has optimized sparse operations?

keyu-tian commented 1 year ago

@yxchng convnextv2 uses masked (dense) conv which would be faster than minkovski engine, see https://github.com/facebookresearch/ConvNeXt-V2/blob/main/TRAINING.md#implementing-fcmae-with-masked-convolution-in-jax. I feel their implementation based on minkovski engine is for reference only. If you test it you may find in this masked pretraining scenario, using masked conv would be faster. ps: minkovski is largely optimized for 3d sparse voxels, which has a lot difference from 2d masked images, like 3d vs 2d, or 2d masked images have smaller and constant sparse ratios, which may make standard conv faster than sparse minkovski conv.