How to replace the backbone and head, I replaced the cls_head and then I got an error

Branch

main branch (1.x version, such as v1.0.0, or dev-1.x branch)

Prerequisite

[X] I have searched Issues and Discussions but cannot get the expected help.
[X] I have read the documentation but cannot get the expected help.
[X] The bug has not been fixed in the latest version.

Environment

System environment: sys.platform: linux Python: 3.8.17 (default, Jul 5 2023, 21:04:15) [GCC 11.2.0] CUDA available: True numpy_random_seed: 812247719 GPU 0: NVIDIA GeForce RTX 3090 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.0, V11.0.221 GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 PyTorch: 1.7.0 PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.0
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_37,code=compute_37
CuDNN 8.0.3
Magma 2.5.2
Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.8.1 OpenCV: 4.8.0 MMEngine: 0.8.3

Runtime environment: cudnn_benchmark: False mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0} dist_cfg: {'backend': 'nccl'} seed: 812247719 diff_rank_seed: False deterministic: False Distributed launcher: none Distributed training: False GPU number: 1

Describe the bug

I replaced the cls_head in the config file of the swin_transformer.py, replaced the original I3Dhead with Timesformerhead, and then I got an error——RuntimeError: mat1 dim 1 must match mat2 dim 0

Reproduces the problem - code sample

1.code ann_file_test = '/mmaction2/classvideo/val1.txt' ann_file_train = '/mmaction2/classvideo/train1.txt' ann_file_val = '/mmaction2/classvideo/val1.txt' auto_scale_lr = dict(base_batch_size=8, enable=False) data_root = '/mmaction2/classvideo/train' data_root_val = '/mmaction2/classvideo/val' dataset_type = 'VideoDataset' default_hooks = dict( checkpoint=dict( interval=3, max_keep_ckpts=3, save_best='auto', type='CheckpointHook'), logger=dict(ignore_last=False, interval=20, type='LoggerHook'), param_scheduler=dict(type='ParamSchedulerHook'), runtime_info=dict(type='RuntimeInfoHook'), sampler_seed=dict(type='DistSamplerSeedHook'), sync_buffers=dict(type='SyncBuffersHook'), timer=dict(type='IterTimerHook')) default_scope = 'mmaction' env_cfg = dict( cudnn_benchmark=False, dist_cfg=dict(backend='nccl'), mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0)) file_client_args = dict(io_backend='disk') launcher = 'none' load_from = None log_level = 'INFO' log_processor = dict(by_epoch=True, type='LogProcessor', window_size=20) model = dict( backbone=dict( arch='base', attn_drop_rate=0.0, drop_path_rate=0.3, drop_rate=0.0, mlp_ratio=4.0, patch_norm=True, patch_size=( 2, 4, 4, ), pretrained= 'https://download.openmmlab.com/mmaction/v1.0/recognition/swin/swin_base_patch4_window7_224.pth', pretrained2d=True, qk_scale=None, qkv_bias=True, type='SwinTransformer3D', window_size=( 8, 7, 7, )), cls_head=dict( average_clips='prob', in_channels=1024, num_classes=7, type='TimeSformerHead'), data_preprocessor=dict( format_shape='NCTHW', mean=[ 123.675, 116.28, 103.53, ], std=[ 58.395, 57.12, 57.375, ], type='ActionDataPreprocessor'), type='Recognizer3D') optim_wrapper = dict( constructor='SwinOptimWrapperConstructor', optimizer=dict( betas=( 0.9, 0.999, ), lr=0.0001, type='AdamW', weight_decay=0.05), paramwise_cfg=dict( absolute_pos_embed=dict(decay_mult=0.0), backbone=dict(lr_mult=0.1), norm=dict(decay_mult=0.0), relative_position_bias_table=dict(decay_mult=0.0)), type='AmpOptimWrapper') param_scheduler = [ dict( begin=0, by_epoch=True, convert_to_iter_based=True, end=2.5, start_factor=0.1, type='LinearLR'), dict( T_max=30, begin=0, by_epoch=True, end=30, eta_min=0, type='CosineAnnealingLR'), ] randomness = dict(deterministic=False, diff_rank_seed=False, seed=None) resume = False test_cfg = dict(type='TestLoop') test_dataloader = dict( batch_size=1, dataset=dict( ann_file='/mmaction2/classvideo/val1.txt', data_prefix=dict(video='/mmaction2/classvideo/val'), pipeline=[ dict(io_backend='disk', type='DecordInit'), dict( clip_len=32, frame_interval=2, num_clips=4, test_mode=True, type='SampleFrames'), dict(type='DecordDecode'), dict(scale=( -1, 224, ), type='Resize'), dict(crop_size=224, type='ThreeCrop'), dict(input_format='NCTHW', type='FormatShape'), dict(type='PackActionInputs'), ], test_mode=True, type='VideoDataset'), num_workers=8, persistent_workers=True, sampler=dict(shuffle=False, type='DefaultSampler')) test_evaluator = dict(type='AccMetric') test_pipeline = [ dict(io_backend='disk', type='DecordInit'), dict( clip_len=32, frame_interval=2, num_clips=4, test_mode=True, type='SampleFrames'), dict(type='DecordDecode'), dict(scale=( -1, 224, ), type='Resize'), dict(crop_size=224, type='ThreeCrop'), dict(input_format='NCTHW', type='FormatShape'), dict(type='PackActionInputs'), ] train_cfg = dict( max_epochs=30, type='EpochBasedTrainLoop', val_begin=1, val_interval=1) train_dataloader = dict( batch_size=2, dataset=dict( ann_file='/mmaction2/classvideo/train1.txt', data_prefix=dict(video='/mmaction2/classvideo/train'), pipeline=[ dict(io_backend='disk', type='DecordInit'), dict( clip_len=32, frame_interval=2, num_clips=1, type='SampleFrames'), dict(type='DecordDecode'), dict(scale=( -1, 256, ), type='Resize'), dict(type='RandomResizedCrop'), dict(keep_ratio=False, scale=( 224, 224, ), type='Resize'), dict(flip_ratio=0.5, type='Flip'), dict(input_format='NCTHW', type='FormatShape'), dict(type='PackActionInputs'), ], type='VideoDataset'), num_workers=8, persistent_workers=True, sampler=dict(shuffle=True, type='DefaultSampler')) train_pipeline = [ dict(io_backend='disk', type='DecordInit'), dict(clip_len=32, frame_interval=2, num_clips=1, type='SampleFrames'), dict(type='DecordDecode'), dict(scale=( -1, 256, ), type='Resize'), dict(type='RandomResizedCrop'), dict(keep_ratio=False, scale=( 224, 224, ), type='Resize'), dict(flip_ratio=0.5, type='Flip'), dict(input_format='NCTHW', type='FormatShape'), dict(type='PackActionInputs'), ] val_cfg = dict(type='ValLoop') val_dataloader = dict( batch_size=2, dataset=dict( ann_file='/mmaction2/classvideo/val1.txt', data_prefix=dict(video='/mmaction2/classvideo/val'), pipeline=[ dict(io_backend='disk', type='DecordInit'), dict( clip_len=32, frame_interval=2, num_clips=1, test_mode=True, type='SampleFrames'), dict(type='DecordDecode'), dict(scale=( -1, 256, ), type='Resize'), dict(crop_size=224, type='CenterCrop'), dict(input_format='NCTHW', type='FormatShape'), dict(type='PackActionInputs'), ], test_mode=True, type='VideoDataset'), num_workers=8, persistent_workers=True, sampler=dict(shuffle=False, type='DefaultSampler')) val_evaluator = dict(type='AccMetric') val_pipeline = [ dict(io_backend='disk', type='DecordInit'), dict( clip_len=32, frame_interval=2, num_clips=1, test_mode=True, type='SampleFrames'), dict(type='DecordDecode'), dict(scale=( -1, 256, ), type='Resize'), dict(crop_size=224, type='CenterCrop'), dict(input_format='NCTHW', type='FormatShape'), dict(type='PackActionInputs'), ] vis_backends = [ dict(type='LocalVisBackend'), ] visualizer = dict( type='ActionVisualizer', vis_backends=[ dict(type='LocalVisBackend'), ]) work_dir = './work_dirs/swin-base-test'

2.change something I replaced the cls_head in the config file of the swin_transformer.py, replaced the original I3Dhead with Timesformerhead, and then I got an error. The following code is the original I3Dhead and the replaced head： cls_head=dict( type='I3DHead', in_channels=768, num_classes=400, spatial_type='avg', dropout_ratio=0.5, average_clips='prob') replaced head： cls_head=dict( type='TimeSformerHead', num_classes=7, in_channels=1024, average_clips='prob')

Reproduces the problem - command or script

python tools/train.py checkpiont/swin-base-test.py

Reproduces the problem - error message

Traceback (most recent call last): File "tools/train.py", line 135, in main() File "tools/train.py", line 131, in main runner.train() File "/usr/local/miniconda3/envs/mmlab/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1745, in train model = self.train_loop.run() # type: ignore File "/usr/local/miniconda3/envs/mmlab/lib/python3.8/site-packages/mmengine/runner/loops.py", line 96, in run self.run_epoch() File "/usr/local/miniconda3/envs/mmlab/lib/python3.8/site-packages/mmengine/runner/loops.py", line 112, in run_epoch self.run_iter(idx, data_batch) File "/usr/local/miniconda3/envs/mmlab/lib/python3.8/site-packages/mmengine/runner/loops.py", line 128, in run_iter outputs = self.runner.model.train_step( File "/usr/local/miniconda3/envs/mmlab/lib/python3.8/site-packages/mmengine/model/base_model/base_model.py", line 114, in train_step losses = self._run_forward(data, mode='loss') # type: ignore File "/usr/local/miniconda3/envs/mmlab/lib/python3.8/site-packages/mmengine/model/base_model/base_model.py", line 340, in _run_forward results = self(data, mode=mode) File "/usr/local/miniconda3/envs/mmlab/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "/mmaction2/mmaction/models/recognizers/base.py", line 262, in forward return self.loss(inputs, data_samples, kwargs) File "/mmaction2/mmaction/models/recognizers/base.py", line 176, in loss loss_cls = self.cls_head.loss(feats, data_samples, loss_kwargs) File "/mmaction2/mmaction/models/heads/base.py", line 99, in loss cls_scores = self(feats, kwargs) File "/usr/local/miniconda3/envs/mmlab/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "/mmaction2/mmaction/models/heads/timesformer_head.py", line 60, in forward cls_score = self.fc_cls(x) File "/usr/local/miniconda3/envs/mmlab/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/usr/local/miniconda3/envs/mmlab/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 93, in forward return F.linear(input, self.weight, self.bias) File "/usr/local/miniconda3/envs/mmlab/lib/python3.8/site-packages/torch/nn/functional.py", line 1692, in linear output = input.matmul(weight.t()) RuntimeError: mat1 dim 1 must match mat2 dim 0

Additional information

I use videodataset and I'm using swin.py's config for action recognition, and I want to make the new model of different components work by replacing backbone, head, etc So I would like to know how to replace components such as backbone and head in the model, as well as the things and processes that need to be paid attention to.

open-mmlab / mmaction2