Video masked autoencoders (VideoMAE) are seen as data-efficient learners for self-supervised video pre-training (SSVP). Inspiration was drawn from the recent ImageMAE, and customized video tube masking with an extremely high ratio was proposed. Due to this simple design, video reconstruction is made a more challenging self-supervision task, leading to the extraction of more effective video representations during this pre-training process. Some hightlights of VideoMAE:
This is a unofficial Keras
implementation of VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training model. The official PyTorch
implementation can be found here.
git clone https://github.com/innat/VideoMAE.git
cd VideoMAE
pip install -e .
There are many variants of VideoMAE mdoels available, i.e. small
, base
, large
, and huge
. And also for benchmark data specific, i.e. Kinetics-400, SSV2, and UCF101. Check this release and model zoo page to know details of it.
Only the inference part is provided for pre-trained VideoMAE models. Using the trained checkpoint, it would be possible to reconstruct the input sample even with high mask ratio. For end-to-end workflow, check this reconstruction.ipynb notebook. Some highlights:
from videomae import VideoMAE_ViTS16PT
# pre-trained self-supervised model
>>> model = VideoMAE_ViTS16PT(img_size=224, patch_size=16)
>>> model.load_weights('TFVideoMAE_B_K400_16x224_PT.h5')
# tube masking
>>> tube_mask = TubeMaskingGenerator(
input_size=window_size,
mask_ratio=0.80
)
>>> make_bool = tube_mask()
>>> bool_masked_pos_tf = tf.constant(make_bool, dtype=tf.int32)
>>> bool_masked_pos_tf = tf.expand_dims(bool_masked_pos_tf, axis=0)
>>> bool_masked_pos_tf = tf.cast(bool_masked_pos_tf, tf.bool)
# running
>>> container = read_video('sample.mp4')
>>> frames = frame_sampling(container, num_frames=16)
>>> pred_tf = model(frames, bool_masked_pos_tf)
>>> pred_tf.numpy().shape
TensorShape([1, 1176, 1536])
A reconstructed results on a sample from SSV2 with mask_ratio=0.8
With the fine-tuned VideoMAE checkpoint, it would be possible to evaluate the benchmark datast and also retraining would be possible on custom dataset. For end-to-end workflow, check this quick [retraining.ipynb]() notebook. It supports both multi-gpu and tpu-vm retraining and evaluation. Some highlights:
from videomae import VideoMAE_ViTS16FT
>>> model = VideoMAE_ViTS16FT(img_size=224, patch_size=16, num_classes=400)
>>> container = read_video('sample.mp4')
>>> frames = frame_sampling(container, num_frames=16)
>>> y = model(frames)
>>> y.shape
TensorShape([1, 400])
>>> probabilities = tf.nn.softmax(y_pred_tf)
>>> probabilities = probabilities.numpy().squeeze(0)
>>> confidences = {
label_map_inv[i]: float(probabilities[i]) \
for i in np.argsort(probabilities)[::-1]
}
>>> confidences
A classification results on a sample from [Kinetics-400]().
Video | Top-5 |
---|---|
{ |
The pre-trained and fine-tuned models are listed in MODEL_ZOO.md. Following are some hightlights.
For Kinetrics-400, VideoMAE is trained around 1600 epoch without any extra data. The following checkpoints are available in both tensorflow SavedModel
and h5
format.
Backbone | #Frame | Top-1 | Top-5 | Params [FT] MB | Params [PT] MB) | FLOPs |
---|---|---|---|---|---|---|
ViT-S | 16x5x3 | 79.0 | 93.8 | 22 | 24 | 57G |
ViT-B | 16x5x3 | 81.5 | 95.1 | 87 | 94 | 181G |
ViT-L | 16x5x3 | 85.2 | 96.8 | 304 | 343 | - |
ViT-H | 16x5x3 | 86.6 | 97.1 | 632 | ? | - |
?* Official ViT-H
backbone of VideoMAE has weight issue in pretrained model, details https://github.com/MCG-NJU/VideoMAE/issues/89.
The FLOPs of encoder models (FT) are reported only.
For SSv2, VideoMAE is trained around 2400 epoch without any extra data.
Backbone | #Frame | Top-1 | Top-5 | Params [FT] MB | Params [PT] MB | FLOPs |
---|---|---|---|---|---|---|
ViT-S | 16x2x3 | 66.8 | 90.3 | 22 | 24 | 57G |
ViT-B | 16x2x3 | 70.8 | 92.4 | 86 | 94 | 181G |
For UCF101, VideoMAE is trained around 3200 epoch without any extra data.
Backbone | #Frame | Top-1 | Top-5 | Params [FT] MB | Params [PT] MB | FLOPS |
---|---|---|---|---|---|---|
ViT-B | 16x5x3 | 91.3 | 98.5 | 86 | 94 | 181G |
Some reconstructed video sample using VideoMAE with different mask ratio.
Kinetics-400-testset | mask |
---|---|
0.8 | |
0.8 | |
0.9 | |
0.9 |
SSv2-testset | mask |
---|---|
0.9 | |
0.9 |
UCF101-testset | mask |
---|---|
0.8 | |
0.9 |
Keras V3
to support multi-framework backend.If you use this videomae implementation in your research, please cite it using the metadata from our CITATION.cff
file.
@inproceedings{tong2022videomae,
title={Video{MAE}: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training},
author={Zhan Tong and Yibing Song and Jue Wang and Limin Wang},
booktitle={Advances in Neural Information Processing Systems},
year={2022}
}