Closed xubin04 closed 8 months ago
Thanks @xubin04. Doing masked pretraining on video data is challenging but valuable. I think your depth-wise idea is worth trying, and maybe you can also refer to [1] or [2] for their masking strategies on ViT's video pretraining.
If you want to pretrain a 3D CNN-like network, I think our codebase can work for you with only a few modifications (data processing, masking, etc.). Feel free to comment here again if you have questions.
[1] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training [2] Masked Autoencoders as spatiotemporal learners
I want to know if I can effectively transfer this pre-trained approach to video prediction tasks. My data has dimensions of batch_size, time_stamp*channel, height, width. Can I mask different regions for different timestamps and can depth-wise processing handle this masking situation effectively.And thank you for your gerat work, this is a big step in CNN pretain method.