keyu-tian / SparK

[ICLR'23 Spotlight🔥] The first successful BERT/MAE-style pretraining on any convolutional network; Pytorch impl. of "Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling"
https://arxiv.org/abs/2301.03580
MIT License
1.41k stars 82 forks source link

How to transfer this method to 3D situation. #74

Closed xubin04 closed 4 months ago

xubin04 commented 6 months ago

I want to know if I can effectively transfer this pre-trained approach to video prediction tasks. My data has dimensions of batch_size, time_stamp*channel, height, width. Can I mask different regions for different timestamps and can depth-wise processing handle this masking situation effectively.And thank you for your gerat work, this is a big step in CNN pretain method.

keyu-tian commented 6 months ago

Thanks @xubin04. Doing masked pretraining on video data is challenging but valuable. I think your depth-wise idea is worth trying, and maybe you can also refer to [1] or [2] for their masking strategies on ViT's video pretraining.

If you want to pretrain a 3D CNN-like network, I think our codebase can work for you with only a few modifications (data processing, masking, etc.). Feel free to comment here again if you have questions.

[1] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training [2] Masked Autoencoders as spatiotemporal learners