kkahatapitiya / X3D-Multigrid

PyTorch implementation of X3D models with Multigrid training.
MIT License
92 stars 13 forks source link

Why eval mode degenerated? #8

Open WUSHUANGPPP opened 3 years ago

WUSHUANGPPP commented 3 years ago

Thanks for your clean implementation! @kkahatapitiya I have two problem to consult you:

  1. I find out the prediction in eval mode always same when I finish training x3d on kinetics-200 dataset. But it's normal if inference in model.train().I failed to find the reason.(base_bn_splits=8 or 1 got same observation, I trained the model in normal way.)
  2. Why some layerx.x.bnx.split_bn.running_var and running_mean keep still alone the whole training process ? image As the chart above, why running_mean and running_var keep same along the whole training process? appreciate it
kkahatapitiya commented 3 years ago

During training split_bn parameters (eg: self.split_bn.running_mean.data) inside SubBatchNorm will be updated, and they will be copied to bn parameters (eg: self.bn.running_mean.data) during eval, by running https://github.com/kkahatapitiya/X3D-Multigrid/blob/d63d8fe6210d2b38aa26d71b0062b569687d6be2/train_x3d_kinetics_multigrid.py#L205

Are you doing this? If so, things should work properly. Also, what is the batch size per gpu and number of splits you use for bn?

WUSHUANGPPP commented 3 years ago

Apprecaite it. I leave out the code you mentioned. I just use your x3d.py and train it according to the common video classification task in my project code, batchsize of 128, 8 gpus without setting any other variables without setting multigrid training details(batchsize 16 per gpu).

#...other backbone...
elif opt.model=='x3d':
        model = x3d3.generate_model('M',n_classes = opt.classes)
#...other backbone...

I just use the interface of generate_model(x3d_version, **kwargs) to generate x3d model and then I'd like to modify the backbone to check other training trick.

WUSHUANGPPP commented 3 years ago

So each epoch we have to run x3d.module.aggregate_sub_bn_stats() otherwise bn parameters would be same as the initial value? @kkahatapitiya Is there any other configureation like this?

kkahatapitiya commented 3 years ago

You have to run aggregate_sub_bn_stats(), before validation (i.e., when you put the model in eval() mode) everytime.

WUSHUANGPPP commented 3 years ago

Hi,@kkahatapitiya .Could you please tell me if you test the performance in normal training setup( not in multigrid training mode )? I trained and test it in constant batchsize of 128 for 350 epochs on Kinetics-200(smaller dataset of 200-class should get more higher performance),then I got the results of 64.0% acc which is similar to the performance of Resnet18 on this dataset.(I run aggregate_sub_bn_stats() each epoch without validation for fast training) initial lr:0.05 optimizer scheduleļ¼šcosin decay

kkahatapitiya commented 1 year ago

Sorry about the long delay in response. Since the data split and multiple training hyperparameters are different, I am not sure what the expected performance would look like. If you train with the given hyperaparameters and the default K400 split, you'll get a number closer to what's reported.