open-mmlab / mmpose

OpenMMLab Pose Estimation Toolbox and Benchmark.
https://mmpose.readthedocs.io/en/latest/
Apache License 2.0
5.81k stars 1.24k forks source link

[Bug] Assertion error after training #2383

Closed kdavidlp123 closed 1 year ago

kdavidlp123 commented 1 year ago

Prerequisite

Environment

OrderedDict([('sys.platform', 'win32'), ('Python', '3.8.16 (default, Mar 2 2023, 03:18:16) [MSC v.1916 64 bit (AMD64)]'), ('CUDA available', True), ('numpy_random_seed', 2147483648), ('GPU 0', 'NVIDIA GeForce RTX 3090'), ('CUDA_HOME', 'C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.1'), ('NVCC', 'Cuda compilation tools, release 11.1, V11.1.105'), ('MSVC', 'Microsoft (R) C/C++ Optimizing Compiler Version 19.29.30148 for x64'), ('GCC', 'n/a'), ('PyTorch', '1.9.1+cu111'), ('PyTorch compiling details', 'PyTorch built with:\n - C++ Version: 199711\n - MSVC 192829337\n - Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications\n - Intel(R) MKL-DNN v2.1.2 (Git Hash 98be7e8afa711dc9b66c8ff3504129cb82013cdb)\n - OpenMP 2019\n - CPU capability usage: AVX2\n - CUDA Runtime 11.1\n - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37\n - CuDNN 8.0.5\n - Magma 2.5.4\n - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=C:/w/b/windows/tmp_bin/sccache-cl.exe, CXX_FLAGS=/DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj -DUSE_PTHREADPOOL -openmp:experimental -IC:/w/b/windows/mkl/include -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=OFF, USE_OPENMP=ON, \n'), ('TorchVision', '0.10.1+cu111'), ('OpenCV', '4.7.0'), ('MMEngine', '0.7.3'), ('MMPose', '1.0.0+2c4a60e')])

Reproduces the problem - code sample

default_scope = 'mmpose'
default_hooks = dict(
    timer=dict(type='IterTimerHook'),
    logger=dict(type='LoggerHook', interval=50),
    param_scheduler=dict(type='ParamSchedulerHook'),
    checkpoint=dict(
        type='CheckpointHook',
        interval=10,
        save_best='coco/AP',
        rule='greater'),
    sampler_seed=dict(type='DistSamplerSeedHook'),
    visualization=dict(type='PoseVisualizationHook', enable=False))
custom_hooks = [dict(type='SyncBuffersHook')]
env_cfg = dict(
    cudnn_benchmark=False,
    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
    dist_cfg=dict(backend='nccl'))
vis_backends = [dict(type='LocalVisBackend')]
visualizer = dict(
    type='PoseLocalVisualizer',
    vis_backends=[dict(type='LocalVisBackend')],
    name='visualizer')
log_processor = dict(
    type='LogProcessor', window_size=50, by_epoch=True, num_digits=6)
log_level = 'INFO'
load_from = None
resume = False
backend_args = dict(backend='local')
train_cfg = dict(by_epoch=True, max_epochs=210, val_interval=10)
val_cfg = dict()
test_cfg = dict()
dataset_info = dict(
    dataset_name='custom',
    paper_info=dict(
        author=
        "Lin, Tsung-Yi and Maire, Michael and Belongie, Serge and Hays, James and Perona, Pietro and Ramanan, Deva and Doll{\\'a}r, Piotr and Zitnick, C Lawrence",
        title='Microsoft coco: Common objects in context',
        container='European conference on computer vision',
        year='2014',
        homepage='http://cocodataset.org/'),
    keypoint_info=dict({
        0:
        dict(name='0', id=0, color=[255, 0, 0], type='', swap=''),
        1:
        dict(name='1', id=1, color=[255, 0, 0], type='', swap=''),
        2:
        dict(name='2', id=2, color=[255, 0, 0], type='', swap=''),
        3:
        dict(name='3', id=3, color=[255, 0, 0], type='', swap=''),
        4:
        dict(name='4', id=4, color=[255, 0, 0], type='', swap=''),
        5:
        dict(name='5', id=5, color=[255, 0, 0], type='', swap=''),
        6:
        dict(name='6', id=6, color=[255, 0, 0], type='', swap=''),
        7:
        dict(name='7', id=7, color=[255, 0, 0], type='', swap=''),
        8:
        dict(name='8', id=8, color=[255, 0, 0], type='', swap=''),
        9:
        dict(name='9', id=9, color=[255, 0, 0], type='', swap=''),
        10:
        dict(name='10', id=10, color=[255, 0, 0], type='', swap=''),
        11:
        dict(name='11', id=11, color=[255, 0, 0], type='', swap=''),
        12:
        dict(name='12', id=12, color=[255, 0, 0], type='', swap=''),
        13:
        dict(name='13', id=13, color=[255, 0, 0], type='', swap=''),
        14:
        dict(name='14', id=14, color=[255, 0, 0], type='', swap=''),
        15:
        dict(name='15', id=15, color=[255, 0, 0], type='', swap=''),
        16:
        dict(name='16', id=16, color=[255, 0, 0], type='', swap=''),
        17:
        dict(name='17', id=17, color=[255, 0, 0], type='', swap=''),
        18:
        dict(name='18', id=18, color=[255, 0, 0], type='', swap=''),
        19:
        dict(name='19', id=19, color=[255, 0, 0], type='', swap=''),
        20:
        dict(name='20', id=20, color=[255, 0, 0], type='', swap='')
    }),
    skeleton_info=dict(),
    joint_weights=[
        1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
        1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0
    ],
    sigmas=[
        0.047, 0.047, 0.047, 0.047, 0.047, 0.047, 0.047, 0.047, 0.047, 0.047,
        0.047, 0.047, 0.047, 0.047, 0.047, 0.047, 0.047, 0.047, 0.047, 0.047,
        0.047
    ])
optim_wrapper = dict(optimizer=dict(type='Adam', lr=0.0005))
param_scheduler = [
    dict(
        type='LinearLR', begin=0, end=500, start_factor=0.001, by_epoch=False),
    dict(
        type='MultiStepLR',
        begin=0,
        end=210,
        milestones=[170, 200],
        gamma=0.1,
        by_epoch=True)
]
auto_scale_lr = dict(base_batch_size=512)
codec = dict(
    type='MSRAHeatmap', input_size=(288, 384), heatmap_size=(72, 96), sigma=3)
channel_cfg = dict(
    num_output_channels=21,
    dataset_joints=21,
    dataset_channel=[[
        0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
        20
    ]],
    inference_channel=[
        0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
        20
    ])
model = dict(
    type='TopdownPoseEstimator',
    data_preprocessor=dict(
        type='PoseDataPreprocessor',
        mean=[123.675, 116.28, 103.53],
        std=[58.395, 57.12, 57.375],
        bgr_to_rgb=True),
    backbone=dict(
        type='ResNet',
        depth=152,
        init_cfg=dict(type='Pretrained',
                      checkpoint='torchvision://resnet152')),
    head=dict(
        type='HeatmapHead',
        in_channels=2048,
        out_channels=21,
        loss=dict(type='KeypointMSELoss', use_target_weight=True),
        decoder=dict(
            type='MSRAHeatmap',
            input_size=(288, 384),
            heatmap_size=(72, 96),
            sigma=3)),
    test_cfg=dict(flip_test=True, flip_mode='heatmap', shift_heatmap=True))
dataset_type = 'CocoDataset'
data_mode = 'topdown'
data_root = 'data/'
train_pipeline = [
    dict(type='LoadImage'),
    dict(type='GetBBoxCenterScale'),
    dict(type='RandomFlip', direction='horizontal'),
    dict(type='RandomHalfBody'),
    dict(type='RandomBBoxTransform'),
    dict(type='TopdownAffine', input_size=(288, 384)),
    dict(
        type='GenerateTarget',
        encoder=dict(
            type='MSRAHeatmap',
            input_size=(288, 384),
            heatmap_size=(72, 96),
            sigma=3)),
    dict(type='PackPoseInputs')
]
val_pipeline = [
    dict(type='LoadImage'),
    dict(type='GetBBoxCenterScale'),
    dict(type='TopdownAffine', input_size=(288, 384)),
    dict(type='PackPoseInputs')
]
train_dataloader = dict(
    batch_size=32,
    num_workers=2,
    persistent_workers=True,
    sampler=dict(type='DefaultSampler', shuffle=True),
    dataset=dict(
        type='CocoDataset',
        data_root='data/',
        data_mode='topdown',
        ann_file='new_train_json/train.json',
        data_prefix=dict(img='new_train/'),
        pipeline=[
            dict(type='LoadImage'),
            dict(type='GetBBoxCenterScale'),
            dict(type='RandomFlip', direction='horizontal'),
            dict(type='RandomHalfBody'),
            dict(type='RandomBBoxTransform'),
            dict(type='TopdownAffine', input_size=(288, 384)),
            dict(
                type='GenerateTarget',
                encoder=dict(
                    type='MSRAHeatmap',
                    input_size=(288, 384),
                    heatmap_size=(72, 96),
                    sigma=3)),
            dict(type='PackPoseInputs')
        ],
        metainfo=dict(from_file='configs/_base_/datasets/custom.py')))
val_dataloader = dict(
    batch_size=32,
    num_workers=2,
    persistent_workers=True,
    drop_last=False,
    sampler=dict(type='DefaultSampler', shuffle=False, round_up=False),
    dataset=dict(
        type='CocoDataset',
        data_root='data/',
        data_mode='topdown',
        ann_file='new_test_json/test.json',
        data_prefix=dict(img='new_test/'),
        test_mode=True,
        pipeline=[
            dict(type='LoadImage'),
            dict(type='GetBBoxCenterScale'),
            dict(type='TopdownAffine', input_size=(288, 384)),
            dict(type='PackPoseInputs')
        ]))
test_dataloader = dict(
    batch_size=32,
    num_workers=2,
    persistent_workers=True,
    drop_last=False,
    sampler=dict(type='DefaultSampler', shuffle=False, round_up=False),
    dataset=dict(
        type='CocoDataset',
        data_root='data/',
        data_mode='topdown',
        ann_file='new_test_json/test.json',
        data_prefix=dict(img='new_test/'),
        test_mode=True,
        pipeline=[
            dict(type='LoadImage'),
            dict(type='GetBBoxCenterScale'),
            dict(type='TopdownAffine', input_size=(288, 384)),
            dict(type='PackPoseInputs')
        ]))
val_evaluator = dict(
    type='CocoMetric', ann_file='data/new_test_json/test.json')
test_evaluator = dict(
    type='CocoMetric', ann_file='data/new_test_json/test.json')
launcher = 'none'
work_dir = 'training_log'

Reproduces the problem - command or script

This is the training script I used

(openmmlab) PS D:\mmpose> python tools/train.py  configs/body_2d_keypoint/topdown_heatmap/coco/td-hm_res152_8xb32-210e_coco-384x288.py --work-dir training_log

Reproduces the problem - error message

05/21 16:34:36 - mmengine - INFO - Checkpoints will be saved to D:\mmpose\training_log.
05/21 16:35:11 - mmengine - INFO - Epoch(train)   [1][50/91]  lr: 4.954910e-05  eta: 3:42:15  time: 0.699657  data_time: 0.125292  memory: 15535  loss: 0.003113  loss_kpt: 0.003113  acc_pose: 0.695949
05/21 16:35:32 - mmengine - INFO - Exp name: td-hm_res152_8xb32-210e_coco-384x288_20230521_163419
05/21 16:35:59 - mmengine - INFO - Epoch(train)   [2][50/91]  lr: 1.406403e-04  eta: 3:04:39  time: 0.530763  data_time: 0.021162  memory: 15535  loss: 0.000698  loss_kpt: 0.000698  acc_pose: 0.967511
05/21 16:36:21 - mmengine - INFO - Exp name: td-hm_res152_8xb32-210e_coco-384x288_20230521_163419
05/21 16:36:47 - mmengine - INFO - Epoch(train)   [3][50/91]  lr: 2.317315e-04  eta: 2:57:37  time: 0.536121  data_time: 0.020744  memory: 15535  loss: 0.000409  loss_kpt: 0.000409  acc_pose: 0.983145
05/21 16:37:09 - mmengine - INFO - Exp name: td-hm_res152_8xb32-210e_coco-384x288_20230521_163419
05/21 16:37:36 - mmengine - INFO - Epoch(train)   [4][50/91]  lr: 3.228226e-04  eta: 2:54:00  time: 0.548706  data_time: 0.020879  memory: 15535  loss: 0.000316  loss_kpt: 0.000316  acc_pose: 0.998413
05/21 16:37:57 - mmengine - INFO - Exp name: td-hm_res152_8xb32-210e_coco-384x288_20230521_163419
05/21 16:38:23 - mmengine - INFO - Epoch(train)   [5][50/91]  lr: 4.139138e-04  eta: 2:50:52  time: 0.530613  data_time: 0.019982  memory: 15535  loss: 0.000258  loss_kpt: 0.000258  acc_pose: 1.000000
05/21 16:38:45 - mmengine - INFO - Exp name: td-hm_res152_8xb32-210e_coco-384x288_20230521_163419
05/21 16:39:12 - mmengine - INFO - Epoch(train)   [6][50/91]  lr: 5.000000e-04  eta: 2:49:28  time: 0.544714  data_time: 0.021153  memory: 15535  loss: 0.000238  loss_kpt: 0.000238  acc_pose: 0.993846
05/21 16:39:33 - mmengine - INFO - Exp name: td-hm_res152_8xb32-210e_coco-384x288_20230521_163419
05/21 16:40:00 - mmengine - INFO - Epoch(train)   [7][50/91]  lr: 5.000000e-04  eta: 2:47:37  time: 0.534784  data_time: 0.021027  memory: 15535  loss: 0.000230  loss_kpt: 0.000230  acc_pose: 0.992089
05/21 16:40:22 - mmengine - INFO - Exp name: td-hm_res152_8xb32-210e_coco-384x288_20230521_163419
05/21 16:40:49 - mmengine - INFO - Epoch(train)   [8][50/91]  lr: 5.000000e-04  eta: 2:46:34  time: 0.535202  data_time: 0.021158  memory: 15535  loss: 0.000166  loss_kpt: 0.000166  acc_pose: 0.998464
05/21 16:41:10 - mmengine - INFO - Exp name: td-hm_res152_8xb32-210e_coco-384x288_20230521_163419
05/21 16:41:37 - mmengine - INFO - Epoch(train)   [9][50/91]  lr: 5.000000e-04  eta: 2:45:03  time: 0.531518  data_time: 0.021564  memory: 15535  loss: 0.000148  loss_kpt: 0.000148  acc_pose: 0.995440
05/21 16:41:58 - mmengine - INFO - Exp name: td-hm_res152_8xb32-210e_coco-384x288_20230521_163419
05/21 16:42:25 - mmengine - INFO - Epoch(train)  [10][50/91]  lr: 5.000000e-04  eta: 2:43:56  time: 0.538365  data_time: 0.020905  memory: 15535  loss: 0.000131  loss_kpt: 0.000131  acc_pose: 0.995228
05/21 16:42:46 - mmengine - INFO - Exp name: td-hm_res152_8xb32-210e_coco-384x288_20230521_163419
05/21 16:42:46 - mmengine - INFO - Saving checkpoint at 10 epochs
Traceback (most recent call last):
  File "tools/train.py", line 160, in <module>
    main()
  File "tools/train.py", line 156, in main
    model = self.train_loop.run()  # type: ignore
  File "C:\Users\user\anaconda3\envs\openmmlab\lib\site-packages\mmengine\runner\loops.py", line 102, in run
    self.runner.val_loop.run()
  File "C:\Users\user\anaconda3\envs\openmmlab\lib\site-packages\mmengine\runner\loops.py", line 363, in run
    self.run_iter(idx, data_batch)
  File "C:\Users\user\anaconda3\envs\openmmlab\lib\site-packages\torch\autograd\grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\user\anaconda3\envs\openmmlab\lib\site-packages\mmengine\runner\loops.py", line 383, in run_iter
    outputs = self.runner.model.val_step(data_batch)
  File "C:\Users\user\anaconda3\envs\openmmlab\lib\site-packages\mmengine\model\base_model\base_model.py", line 133, in val_step
    return self._run_forward(data, mode='predict')  # type: ignore
  File "C:\Users\user\anaconda3\envs\openmmlab\lib\site-packages\mmengine\model\base_model\base_model.py", line 340, in _run_forward
    results = self(**data, mode=mode)
  File "C:\Users\user\anaconda3\envs\openmmlab\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "d:\mmpose\mmpose\models\pose_estimators\base.py", line 140, in forward
    return self.predict(inputs, data_samples)
  File "d:\mmpose\mmpose\models\pose_estimators\topdown.py", line 109, in predict
    preds = self.head.predict(feats, data_samples, test_cfg=self.test_cfg)
  File "d:\mmpose\mmpose\models\heads\heatmap_heads\heatmap_head.py", line 261, in predict
    _batch_heatmaps_flip = flip_heatmaps(
  File "d:\mmpose\mmpose\models\utils\tta.py", line 39, in flip_heatmaps
    assert len(flip_indices) == heatmaps.shape[1]

After the training process stopped, I encountered this problem.

Additional information

I encountered a bug mentioned above after training, and I didn't know whether this bug leaded to the absence of the test (evaluators? )or did I miss some args for the visualization of test? Thank you !!

Tau-J commented 1 year ago

Thanks for using MMPose. You need to specify swap in dataset info to define the flip pairs that RandomFlip requires. Please refer to docs for more details.