How to transfer this method to 3D situation.

keyu-tian / SparK

[ICLR'23 Spotlight🔥] The first successful BERT/MAE-style pretraining on any convolutional network; Pytorch impl. of "Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling"

MIT License

1.46k stars 84 forks source link

Thanks @xubin04. Doing masked pretraining on video data is challenging but valuable. I think your depth-wise idea is worth trying, and maybe you can also refer to [1] or [2] for their masking strategies on ViT's video pretraining.

If you want to pretrain a 3D CNN-like network, I think our codebase can work for you with only a few modifications (data processing, masking, etc.). Feel free to comment here again if you have questions.

[1] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training [2] Masked Autoencoders as spatiotemporal learners

keyu-tian / SparK

How to transfer this method to 3D situation. #74