Closed HIT-SIRS closed 1 year ago
You can try the Self-supervised in Convnextv2, which may be more efficient.
@DZ1533 @powermano MAE is more efficient due to the sparsity of Transformer and attention. Convnextv2 uses the same masked convolution like SparK so the efficiency would be MAE > SparK == ConvNeXtv2. The more efficient pretraining on CNNs is thus a challenge and still to be explored.
@keyu-tian shouldn't convnextv2 be more efficient due to the use of minkovski engine which has optimized sparse operations?
@yxchng convnextv2 uses masked (dense) conv which would be faster than minkovski engine, see https://github.com/facebookresearch/ConvNeXt-V2/blob/main/TRAINING.md#implementing-fcmae-with-masked-convolution-in-jax. I feel their implementation based on minkovski engine is for reference only. If you test it you may find in this masked pretraining scenario, using masked conv would be faster. ps: minkovski is largely optimized for 3d sparse voxels, which has a lot difference from 2d masked images, like 3d vs 2d, or 2d masked images have smaller and constant sparse ratios, which may make standard conv faster than sparse minkovski conv.
Thanks for your work, we compared SparK with MAE (convnext-base v.s. swin-base), the training time of SparK is about 6.5 times that of MAE. Is there any way to improve training efficiency? Is the lengthy training time caused by insufficient hardware optimization of sparse convolution?