used resume-from get error like:loaded state dict contains a parameter group that doesn't match the size of optimizer's group

eoozbq commented 2 years ago

Before raising a question, you may need to check the following listed items.

Checklist

I have searched related issues but cannot get the expected help.
I have read the FAQ documentation but cannot get the expected help.

INFO - Environment info

sys.platform: linux Python: 3.8.8 (default, Apr 13 2021, 19:58:26) [GCC 7.3.0] CUDA available: True GPU 0: Tesla P100-PCIE-16GB CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 10.0, V10.0.130 GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 PyTorch: 1.7.1+cu101 PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) oneAPI Math Kernel Library Version 2021.2-Product Build 20210312 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 10.1
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75
CuDNN 7.6.3
Magma 2.5.2
Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.8.2+cu101 OpenCV: 4.5.4 MMCV: 1.4.6 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 10.1 MMAction2: 0.22.0+ccd88e5

2022-08-17 11:06:55,772 - mmaction - INFO - Distributed training: False 2022-08-17 11:06:56,553 - mmaction - INFO - Config: model = dict( type='Recognizer3D', backbone=dict( type='ResNet3dSlowFast', pretrained=None, resample_rate=8, # tau speed_ratio=8, # alpha channel_ratio=8, # beta_inv slow_pathway=dict( type='resnet3d', depth=50, pretrained=None, lateral=True, conv1_kernel=(1, 7, 7), dilations=(1, 1, 1, 1), conv1_stride_t=1, pool1_stride_t=1, inflate=(0, 0, 1, 1), norm_eval=False), fast_pathway=dict( type='resnet3d', depth=50, pretrained=None, lateral=False, base_channels=8, conv1_kernel=(5, 7, 7), conv1_stride_t=1, pool1_stride_t=1, norm_eval=False)), cls_head=dict( type='SlowFastHead', in_channels=2304, num_classes=4, spatial_type='avg', dropout_ratio=0.5), train_cfg=None, test_cfg=dict(average_clips='prob', max_testing_views=1)) dataset_type = 'RawframeDataset' data_root = 'data/ucf101/rawframes/' data_root_val = 'data/ucf101/rawframes/' ann_file_train = 'data/ucf101/ucf101_train_split_1_rawframes.txt' ann_file_val = 'data/ucf101/ucf101_val_split_1_rawframes.txt' ann_file_test = 'data/ucf101/ucf101_val_split_1_rawframes.txt' img_norm_cfg = dict( mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_bgr=False) train_pipeline = [ dict(type='SampleFrames', clip_len=32, frame_interval=2, num_clips=1), dict(type='RawFrameDecode'), dict(type='Resize', scale=(-1, 256)), dict(type='RandomResizedCrop'), dict(type='Resize', scale=(224, 224), keep_ratio=False), dict(type='Flip', flip_ratio=0.5), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_bgr=False), dict(type='FormatShape', input_format='NCTHW'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs', 'label']) ] val_pipeline = [ dict( type='SampleFrames', clip_len=32, frame_interval=2, num_clips=1, test_mode=True), dict(type='RawFrameDecode'), dict(type='Resize', scale=(-1, 256)), dict(type='CenterCrop', crop_size=224), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_bgr=False), dict(type='FormatShape', input_format='NCTHW'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs']) ] test_pipeline = [ dict( type='SampleFrames', clip_len=32, frame_interval=2, num_clips=10, test_mode=True), dict(type='RawFrameDecode'), dict(type='Resize', scale=(-1, 256)), dict(type='ThreeCrop', crop_size=256), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_bgr=False), dict(type='FormatShape', input_format='NCTHW'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs']) ] data = dict( videos_per_gpu=8, workers_per_gpu=8, val_dataloader=dict(workers_per_gpu=1, videos_per_gpu=1), test_dataloader=dict(workers_per_gpu=1, videos_per_gpu=1), train=dict( num_work=0, type='RawframeDataset', ann_file='data/ucf101/ucf101_train_split_1_rawframes.txt', data_prefix='data/ucf101/rawframes/', pipeline=[ dict( type='SampleFrames', clip_len=32, frame_interval=2, num_clips=1), dict(type='RawFrameDecode'), dict(type='Resize', scale=(-1, 256)), dict(type='RandomResizedCrop'), dict(type='Resize', scale=(224, 224), keep_ratio=False), dict(type='Flip', flip_ratio=0.5), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_bgr=False), dict(type='FormatShape', input_format='NCTHW'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs', 'label']) ]), val=dict( type='RawframeDataset', ann_file='data/ucf101/ucf101_val_split_1_rawframes.txt', data_prefix='data/ucf101/rawframes/', pipeline=[ dict( type='SampleFrames', clip_len=32, frame_interval=2, num_clips=1, test_mode=True), dict(type='RawFrameDecode'), dict(type='Resize', scale=(-1, 256)), dict(type='CenterCrop', crop_size=224), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_bgr=False), dict(type='FormatShape', input_format='NCTHW'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs']) ]), test=dict( type='RawframeDataset', ann_file='data/ucf101/ucf101_val_split_1_rawframes.txt', data_prefix='data/ucf101/rawframes/', pipeline=[ dict( type='SampleFrames', clip_len=32, frame_interval=2, num_clips=10, test_mode=True), dict(type='RawFrameDecode'), dict(type='Resize', scale=(-1, 256)), dict(type='ThreeCrop', crop_size=256), dict( type='Normalize', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_bgr=False), dict(type='FormatShape', input_format='NCTHW'), dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]), dict(type='ToTensor', keys=['imgs']) ])) optimizer = dict(type='SGD', lr=0.0125, momentum=0.9, weight_decay=0.0001) optimizer_config = dict(grad_clip=dict(max_norm=40, norm_type=2)) lr_config = dict( policy='CosineAnnealing', min_lr=0, warmup='linear', warmup_by_epoch=True, warmup_iters=34) total_epochs = 256 checkpoint_config = dict(interval=4) workflow = [('train', 1)] evaluation = dict( interval=5, metrics=['top_k_accuracy', 'mean_class_accuracy']) log_config = dict(interval=20, hooks=[dict(type='TextLoggerHook')]) dist_params = dict(backend='nccl') log_level = 'INFO'

I had run this cfg file, but add some attention model in resnet3d.py file, I got error like this

2022-08-17 11:07:04,887 - mmaction - INFO - load checkpoint from local path: work_dirs/x3dfast-GC/latest.pth 2022-08-17 11:07:06,459 - mmaction - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: backbone.slow_path.layer1.0.ContextBlock.conv_mask.weight, backbone.slow_path.layer1.0.ContextBlock.conv_mask.bias, backbone.slow_path.layer1.0.ContextBlock.channel_add_conv.0.weight, backbone.slow_path.layer1.0.ContextBlock.channeladd~~~~~~ Traceback (most recent call last): File "tools/train.py", line 216, in main() File "tools/train.py", line 204, in main train_model( File "/data/mmaction2/mmaction/apis/train.py", line 222, in train_model runner.resume(cfg.resume_from) File "/root/anaconda3/envs/mmaction/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 389, in resume self.optimizer.load_state_dict(checkpoint['optimizer']) File "/root/anaconda3/envs/mmaction/lib/python3.8/site-packages/torch/optim/optimizer.py", line 124, in load_state_dict raise ValueError("loaded state dict contains a parameter group " ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group

I can train this model again, but I can't resume this model from checkpoint file

I had tried to run this cfg with attention model, but when i used --resume-from to train model, I would got error. I want to konw how to slove this problem.

hukkai commented 2 years ago

@eoozbq What is the weight key "backbone.slow_path.layer1.0.ContextBlock.conv_mask.weight"? It seems that there is no such key in the slowfast model? Did you modify the network?

eoozbq commented 2 years ago

yes ,I had changed slowfast model, and the new model could training but when i used --resume-form, I got error. I want to konw why chould this error happend.

hukkai commented 2 years ago

The reason might be that your module is added after self.init_weights(). So when you resume your model at init_weights, the model find unexpected keys. How do you add your module?

eoozbq commented 2 years ago

I modified my module according to mmaction2's official tutorial, and then added it, ignoring the sequence of initialization operations. Can it run normally after I modify the initialization code?

open-mmlab / mmaction2