open-mmlab / mmpose

OpenMMLab Pose Estimation Toolbox and Benchmark.
https://mmpose.readthedocs.io/en/latest/
Apache License 2.0
5.69k stars 1.23k forks source link

Error during evaluation. #1910

Open amartincirera opened 1 year ago

amartincirera commented 1 year ago

Hello,

I'm training a a derk pose model using my own dataset. After preper all the config files, add a new class dataset I start the training process but after a few epochs, in evaluate i got this error:

2022-12-31 16:27:32,881 - mmpose - INFO - Epoch [2][850/884] lr: 1.000e-03, eta: 18:10:46, time: 0.548, data_time: 0.000, memory: 11201, loss_hms: 0.0002, loss_ofs: 0.0001, loss: 0.0003 2022-12-31 16:27:51,362 - mmpose - INFO - Saving checkpoint at 2 epochs [ ] 0/780, elapsed: 0s, ETA:/home/martina/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1640811806235/work/aten/src/ATen/native/TensorShape.cpp:2157.) return _VF.meshgrid(tensors, kwargs) # type: ignore[attr-defined] /home/martina/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/nn/functional.py:4003: UserWarning: Default grid_sample and affine_grid behavior has changed to align_corners=False since 1.3.0. Please specify align_corners=True if the old behavior is desired. See the documentation of grid_sample for details. warnings.warn( Traceback (most recent call last): File "tools/train.py", line 201, in main() File "tools/train.py", line 190, in main train_model( File "/home/martina/mmpose/mmpose/apis/train.py", line 213, in train_model runner.run(data_loaders, cfg.workflow, cfg.total_epochs) File "/home/martina/miniconda3/envs/openmmlab/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 136, in run epoch_runner(data_loaders[i], kwargs) File "/home/martina/miniconda3/envs/openmmlab/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 58, in train self.call_hook('after_train_epoch') File "/home/martina/miniconda3/envs/openmmlab/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 317, in call_hook getattr(hook, fn_name)(self) File "/home/martina/miniconda3/envs/openmmlab/lib/python3.8/site-packages/mmcv/runner/hooks/evaluation.py", line 271, in after_train_epoch self._do_evaluate(runner) File "/home/martina/miniconda3/envs/openmmlab/lib/python3.8/site-packages/mmcv/runner/hooks/evaluation.py", line 275, in _do_evaluate results = self.test_fn(runner.model, self.dataloader) File "/home/martina/mmpose/mmpose/apis/test.py", line 33, in single_gpu_test result = model(return_loss=False, data) File "/home/martina/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/home/martina/miniconda3/envs/openmmlab/lib/python3.8/site-packages/mmcv/parallel/data_parallel.py", line 51, in forward return super().forward(inputs, kwargs) File "/home/martina/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward return self.module(*inputs[0], kwargs[0]) File "/home/martina/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/home/martina/miniconda3/envs/openmmlab/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 119, in new_func return old_func(args, kwargs) File "/home/martina/mmpose/mmpose/models/detectors/one_stage.py", line 136, in forward return self.forward_test( File "/home/martina/mmpose/mmpose/models/detectors/one_stage.py", line 373, in forward_test re_scores = self.rescore_net(np.stack(preds, axis=0), skeleton) File "/home/martina/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/home/martina/mmpose/mmpose/models/utils/rescore.py", line 70, in forward x = self.relu(self.l1(feature)) File "/home/martina/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, **kwargs) File "/home/martina/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 103, in forward return F.linear(input, self.weight, self.bias) File "/home/martina/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/nn/functional.py", line 1848, in linear return torch._C._nn.linear(input, weight, bias) RuntimeError: mat1 and mat2 shapes cannot be multiplied (1x33 and 74x256)

Also when I am using this model to evaluate with a random image, the error is the same. Am I doing something wrong when I' training the model? Seems at some point the shapes are not the same, but I don't know what I'm doing wrong.

Please, I would appreciate any help. Thanks

ly015 commented 1 year ago

Hi, sorry for the late reply. Could you please provide the MMPose version and the config file?

amartincirera commented 1 year ago

Dear all

`base = [ 'base/default_runtime.py', 'base/datasets/horse.py' ] checkpoint_config = dict(interval=20) evaluation = dict(interval=20, metric='mAP', save_best='AP')

optimizer = dict( type='Adam', lr=0.001, ) optimizer_config = dict(grad_clip=None)

learning policy

lr_config = dict(policy='step', step=[90, 120]) total_epochs = 140 channel_cfg = dict( dataset_joints=17, dataset_channel=[ [0, 1, 2, 3, 4, 5, 6, 7, 8], ], inference_channel=[ 0, 1, 2, 3, 4, 5, 6, 7, 8 ])

data_cfg = dict( image_size=512, base_size=256, base_sigma=2, heatmap_size=[128], num_joints=channel_cfg['dataset_joints'], dataset_channel=channel_cfg['dataset_channel'], inference_channel=channel_cfg['inference_channel'], num_scales=1, scale_aware_sigma=False, with_bbox=True, use_nms=True, soft_nms=False, oks_thr=0.8, )

model settings

model = dict( type='CID', pretrained='https://download.openmmlab.com/mmpose/' 'pretrain_models/hrnet_w32-36af842e.pth', backbone=dict( type='HRNet', in_channels=3, extra=dict( stage1=dict( num_modules=1, num_branches=1, block='BOTTLENECK', num_blocks=(4, ), num_channels=(64, )), stage2=dict( num_modules=1, num_branches=2, block='BASIC', num_blocks=(4, 4), num_channels=(32, 64)), stage3=dict( num_modules=4, num_branches=3, block='BASIC', num_blocks=(4, 4, 4), num_channels=(32, 64, 128)), stage4=dict( num_modules=3, num_branches=4, block='BASIC', num_blocks=(4, 4, 4, 4), num_channels=(32, 64, 128, 256), multiscale_output=True)), ), keypoint_head=dict( type='CIDHead', in_channels=480, gfd_channels=32, num_joints=9, multi_hm_loss_factor=1.0, single_hm_loss_factor=4.0, contrastive_loss_factor=1.0, max_train_instances=200, prior_prob=0.01), train_cfg=dict(), test_cfg=dict( num_joints=channel_cfg['dataset_joints'], flip_test=True, max_num_people=2, detection_threshold=0.01, center_pool_kernel=3))

train_pipeline = [ dict(type='LoadImageFromFile'), dict( type='BottomUpRandomAffine', rot_factor=30, scale_factor=[0.75, 1.5], scale_type='short', trans_factor=40), dict(type='BottomUpRandomFlip', flip_prob=0.5), dict(type='ToTensor'), dict( type='NormalizeTensor', mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), dict( type='CIDGenerateTarget', max_num_people=2, ), dict( type='Collect', keys=[ 'img', 'multi_heatmap', 'multi_mask', 'instance_coord', 'instance_heatmap', 'instance_mask', 'instance_valid' ], meta_keys=[]), ]

val_pipeline = [ dict(type='LoadImageFromFile'), dict(type='BottomUpGetImgSize', test_scale_factor=[1]), dict( type='BottomUpResizeAlign', transforms=[ dict(type='ToTensor'), dict( type='NormalizeTensor', mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), ]), dict( type='Collect', keys=['img'], meta_keys=[ 'image_file', 'aug_data', 'test_scale_factor', 'base_size', 'center', 'scale', 'flip_index' ]), ]

test_pipeline = val_pipeline

data_root = '/home/martina/samba_shareGPU/KeyPointsHorseDataset/' data = dict( workers_per_gpu=2, train_dataloader=dict(samples_per_gpu=20), val_dataloader=dict(samples_per_gpu=1), test_dataloader=dict(samples_per_gpu=1), train=dict( type='BottomUpHorseKeyPoints', ann_file=f'{data_root}/train.json', img_prefix=f'{data_root}/images/', data_cfg=data_cfg, pipeline=train_pipeline, dataset_info={{base.dataset_info}}), val=dict( type='BottomUpHorseKeyPoints', ann_file=f'{data_root}/val.json', img_prefix=f'{data_root}/images/', data_cfg=data_cfg, pipeline=val_pipeline, dataset_info={{base.dataset_info}}), test=dict( type='BottomUpHorseKeyPoints', ann_file=f'{data_root}/val.json', img_prefix=f'{data_root}/images/', data_cfg=data_cfg, pipeline=test_pipeline, dataset_info={{base.dataset_info}}), ) `

And the version is 0.29.

Thank you for your help. I don`t know what I'm doing wrong.

BR, Albert

Jaykob commented 1 year ago

I can confirm this as I'm facing the same problem...

ly015 commented 1 year ago

@Ben-Louis Could you please have a look at this issue?

Ben-Louis commented 1 year ago

Hello everyone, I have a question regarding the specific model that you are currently training. Is it the DEKR or CID model?

If you are training the DEKR model and are encountering the following error message: RuntimeError: mat1 and mat2 shapes cannot be multiplied, the issue may be due to a mismatch between the pose definition of your dataset and COCO.

In DEKR model, the rescore_cfg configuration establishes a rescore net that serves to evaluate OKS during post-processing. However, it should be noted that this rescore net is exclusively applicable to poses defined in COCO and cannot be used for custom datasets. Hence, it would be advisable to remove the rescore_cfg item when training DEKR on your own dataset.

Jaykob commented 1 year ago

Thanks for your help! I’m trying to train a DEKR model with my own dataset that is based on COCO but has only 3 keypoints instead of 17. Any advice how to solve this error in my case? I tried to play around with the restore_cfg but didn’t know what to put in for channels_in

I think I already fixed another problem concerning the normalization indexes which need to be lower of course when only training 3 joints.

Ben-Louis commented 1 year ago

To ensure consistency with the authors' approach, we utilize the pre-trained rescore net without updating it during our training. However, if modifications are made to the in_channels, the rescore net will be initialized randomly instead of using the pre-trained weights. This can lead to a degradation in the accuracy of our model predictions, as the randomly initialized rescore net may not perform as well as the pre-trained version. Therefore, if you plan to train models using your own dataset, we recommend removing the rescore_cfg option, as this net does not significantly enhance model accuracy.

Jaykob commented 1 year ago

Ok! That’s what I tried first but then I got an index out of bounds error because of the normalization, where the torso keypoints are expected to be at index 5 and 6 (if I remember correctly) and the array is only of length 3 in my case. That would be in keypoint_eval.py in _calc_distances()

So I tried to change norm_indexes using the restore_cfg to prevent this.

Jaykob commented 1 year ago

BTW: I'm using this for evaluation. Probably that's why I'm running into these problems? I don't have coco valid files for my dataset, so that's why I'm using these metrics successfully with other networks.

val_evaluator = [
    dict(type='PCKAccuracy', thr=0.2),
    dict(type='AUC'),
    dict(type='EPE'),
]
test_evaluator = val_evaluator