kkahatapitiya / X3D-Multigrid

PyTorch implementation of X3D models with Multigrid training.
MIT License
92 stars 13 forks source link

X3D No Multigrid #4

Closed RaivoKoot closed 3 years ago

RaivoKoot commented 3 years ago

I am planning to use your implementation in x3d.py and use it in my own training environment to train X3D with a constant batch size. I don't want to use any multigrid features. I will be using my own dataloaders and datasets and so on. In the below model instantiation snippet, I am unsure about one parameter:

x3d = resnet_x3d.generate_model(x3d_version=X3D_VERSION, n_classes=400, n_input_channels=3,
                                dropout=0.5, base_bn_splits=BASE_BS_PER_GPU//CONST_BN_SIZE)

What is base_bn_splits? If I use a single GPU and a constant batch size, what value do I need to give this parameter? Thanks a lot! @kkahatapitiya

kkahatapitiya commented 3 years ago

In case you don't want to use multigrid training, please follow the training script for charades: train_x3d_charades.py. It will be more convenient.

base_bn_splits=BASE_BS_PER_GPU//CONST_BN_SIZE is to keep a constant batch size in batchnorm operations when input batch size is changing. You can simply usebase_bn_splits=1 if you want the input batch size to be considered inside batchnorm.

RaivoKoot commented 3 years ago

Ah thats perfect. Thank you for the help :)

stalagmite7 commented 3 years ago

Stumbled onto this question which raises another one for me: is train_x3d_charades.py training without multigrid? I.E. if I wanted to train a charades-like dataset using multigrid, I will need to adapt from the kinetics mutligrid training script?

kkahatapitiya commented 3 years ago

Yes, that's correct. However, multigrid training is more useful in large-scale datasets such as Kinetics. Even without multigrid, one can train on Charades in a couple of hours.

stalagmite7 commented 3 years ago

Ah I see. Makes sense, I want to use multigrid because I have a charades-like dataset that is still pretty large (almost as large as kinetics), hence. Closer checks seem like you assume a sample duration of 16 seconds for Kinetics (w/ multigrid) -> is there a reason why? Since I'm trying to add the same functionality to my custom dataset

kkahatapitiya commented 3 years ago

I think kinetics videos are 25fps, let's talk about input duration in number of frames rather than seconds. Yes, X3D-M architecture takes in 16 frames at an input stride of 5 (i.e., covers a range of 16x5=80 frames). For shorter training schedules, it is suggested to have a longer stride. So in our setting we consider 16 frames with a stride of 10 (covers 160 frames). Temporal range of your input for training depends on the architecture/dataset and how much augmentations you provide by randomly sampling clips. Generally, I have seen networks considering 64 (I3D, SlowFast, ...) or, 160 (X3D-M) length clips for training on Kinetics (~10s videos at 25fps = ~250frames). If your data has similar stats, you can use the same setting. Otherwise, change stride/number of frames accordingly to provide enough frames at input, while having enough temporal padding for randomization. Pre-trained models work best at the given input frame-rate (25fps/stride of 10).

stalagmite7 commented 3 years ago

Perfect, this is great insight, thanks very much!