kkahatapitiya / X3D-Multigrid

PyTorch implementation of X3D models with Multigrid training.
MIT License
91 stars 13 forks source link

How to set the super-parameter when doing validation? #6

Closed wanghao14 closed 3 years ago

wanghao14 commented 3 years ago

Hi, thanks a lot for sharing your implementation! I want to use your pretrained model to do validation, and if I only have one GPU, how should I modify the super-parameters, especially base_bn_splits used in generate_model. And I want to know whether the model named "x3d_multigrid_kinetics_fb_pretrained.pt" is modified from the provided model by facebook? Looking forward to your reply.

kkahatapitiya commented 3 years ago

You can try BS_UPSCALE=4 and GPUS=1. If you still can't fit it in the memory, reduce BS_UPSCALE further, but, in this case I think you have to reduce CONST_BN_SIZE by the same factor to use pretrained weights for the batchnorm layers. Otherwise you'll have to initialize the batchnorm layers from scratch.

Yes, x3d_multigrid_kinetics_fb_pretrained.pt is the weights ported from FAIR implementation, which is trained with a longer schedule and gives a better pretrained accuracy.

wanghao14 commented 3 years ago

@kkahatapitiya Thanks for your reply! I had tried BS_UPSCALE=4 and GPUS=1 but there was an error: RuntimeError: Error(s) in loading state_dict for ResNet: size mismatch for bn1.split_bn.running_mean: copying a param with shape torch.Size([24]) from checkpoint, the shape in current model is torch.Size([96]). size mismatch for bn1.split_bn.running_var: copying a param with shape torch.Size([24]) from checkpoint, the shape in current model is torch.Size([96]). size mismatch for layer1.0.bn1.split_bn.running_mean: copying a param with shape torch.Size([54]) from checkpoint, the shape in current model is torch.Size([216]). size mismatch for layer1.0.bn1.split_bn.running_var: copying a param with shape torch.Size([54]) from checkpoint, the shape in current model is torch.Size([216]). size mismatch for layer1.0.bn2.split_bn.running_mean: copying a param with shape torch.Size([54]) from checkpoint, the shape in current model is torch.Size([216]). size mismatch for layer1.0.bn2.split_bn.running_var: copying a param with shape torch.Size([54]) from checkpoint, the shape in current model is torch.Size([216]). size mismatch for layer1.0.bn3.split_bn.running_mean: copying a param with shape torch.Size([24]) from checkpoint, the shape in current model is torch.Size([96]). size mismatch for layer1.0.bn3.split_bn.running_var: copying a param with shape torch.Size([24]) from checkpoint, the shape in current model is torch.Size([96]). size mismatch for layer1.0.downsample.1.split_bn.running_mean: copying a param with shape torch.Size([24]) from checkpoint, the shape in current model is torch.Size([96]). size mismatch for layer1.0.downsample.1.split_bn.running_var: copying a param with shape torch.Size([24]) from checkpoint, the shape in current model is torch.Size([96]). size mismatch for layer1.1.bn1.split_bn.running_mean: copying a param with shape torch.Size([54]) from checkpoint, the shape in current model is torch.Size([216]). size mismatch for layer1.1.bn1.split_bn.running_var: copying a param with shape torch.Size([54]) from checkpoint, the shape in current model is torch.Size([216]). ...... So I modified BS_UPSCALE to 1 and the error disappeared.