open-mmlab / mmpose

OpenMMLab Pose Estimation Toolbox and Benchmark.
https://mmpose.readthedocs.io/en/latest/
Apache License 2.0
5.68k stars 1.23k forks source link

training on a custom dataset mmpose V1.0.0 #1941

Closed LjIA26 closed 1 year ago

LjIA26 commented 1 year ago

Thanks for your error report and we appreciate it a lot. If you feel we have helped you, give us a STAR! :satisfied:

Checklist

  1. I have searched related issues but cannot get the expected help. Yes
  2. The bug has not been fixed in the latest version. Yes

Describe the bug

I switched to the new version of mmpose, from 0.28 to 1.0, and used the same steps to create a config file and datasets as the previous version, making sure I follow the new configuration. However, I get the following error.

Reproduction

python tools/train.py configs/plantsv4.py --work-dir test_R/```
Heatmap/Coco

Environment

Python: 3.9.16 (main, Jan 11 2023, 16:16:36) [MSC v.1916 64 bit (AMD64)]
CUDA available: True
numpy_random_seed: 2147483648
GPU 0: NVIDIA GeForce RTX 2080 Ti
CUDA_HOME: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2
NVCC: Cuda compilation tools, release 10.2, V10.2.8
MSVC: Microsoft (R) C/C++ Optimizing Compiler Version 19.29.30133 for x64
GCC: n/a
PyTorch: 1.12.1+cu113
PyTorch compiling details: PyTorch built with:
  - C++ Version: 199711
  - MSVC 192829337
  - Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
  - OpenMP 2019
  - LAPACK is enabled (usually provided by MKL)
  - CPU capability usage: AVX2
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.3.2  (built against CUDA 11.5)
  - Magma 2.5.4
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.3.2, CXX_COMPILER=C:/actions-runner/_work/pytorch/pytorch/builder/windows/tmp_bin/sccache-cl.exe, CXX_FLAGS=/DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj -DUSE_PTHREADPOOL -openmp:experimental -IC:/actions-runner/_work/pytorch/pytorch/builder/windows/mkl/include -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=OFF, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.13.1+cu113
OpenCV: 4.7.0
MMEngine: 0.5.0
MMPose: 1.0.0rc0+c6de39c
  1. You may add addition that may be helpful for locating the problem, such as

Error traceback

If applicable, paste the error traceback here.

01/25 16:37:13 - mmengine - INFO - Checkpoints will be saved to D:\toolkit\mmpose\test_R.
Traceback (most recent call last):
  File "D:\toolkit\mmpose\tools\train.py", line 161, in <module>
    main()
  File "D:\toolkit\mmpose\tools\train.py", line 157, in main
    runner.train()
  File "C:\Users\LJ\anaconda3\envs\toolkit\lib\site-packages\mmengine\runner\runner.py", line 1686, in train
    model = self.train_loop.run()  # type: ignore
  File "C:\Users\LJ\anaconda3\envs\toolkit\lib\site-packages\mmengine\runner\loops.py", line 90, in run
    self.run_epoch()
  File "C:\Users\LJ\anaconda3\envs\toolkit\lib\site-packages\mmengine\runner\loops.py", line 106, in run_epoch
    self.run_iter(idx, data_batch)
  File "C:\Users\LJ\anaconda3\envs\toolkit\lib\site-packages\mmengine\runner\loops.py", line 122, in run_iter
    outputs = self.runner.model.train_step(
  File "C:\Users\LJ\anaconda3\envs\toolkit\lib\site-packages\mmengine\model\base_model\base_model.py", line 114, in train_step
    losses = self._run_forward(data, mode='loss')  # type: ignore
  File "C:\Users\LJ\anaconda3\envs\toolkit\lib\site-packages\mmengine\model\base_model\base_model.py", line 326, in _run_forward
    results = self(**data, mode=mode)
  File "C:\Users\LJ\anaconda3\envs\toolkit\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "d:\toolkit\mmpose\mmpose\models\pose_estimators\base.py", line 74, in forward
    return self.loss(inputs, data_samples)
  File "d:\toolkit\mmpose\mmpose\models\pose_estimators\topdown.py", line 109, in loss
    self.head.loss(feats, data_samples, train_cfg=self.train_cfg))
  File "d:\toolkit\mmpose\mmpose\models\heads\heatmap_heads\heatmap_head.py", line 322, in loss
    gt_heatmaps = torch.stack(
RuntimeError: stack expects each tensor to be equal size, but got [17, 64, 48] at entry 0 and [100, 64, 48] at entry 1

Bug fix

If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

Ben-Louis commented 1 year ago

Hi @LjIA26, thanks for using MMPose. Do all instances in your dataset have the same number of keypoints?

LjIA26 commented 1 year ago

Hello, Yes! all of them have 100 keypoints.

Ben-Louis commented 1 year ago

The reported error indicates that some instances have 17 keypoints while others have 100 keypoints. Could you provide more details about the config file you're using so we can help find where the problem is?

LjIA26 commented 1 year ago

The base model I am using as based uses 17 keypoints for the coco dataset. My custom dataset has 100 keypoints, all of the images have 100 keypoints, which sometimes they are zero.

_base_ = [
    './_base_/default_runtime.py',
    './_base_/datasets/plants4.py'
]
 **_runtime_**
train_cfg = dict(max_epochs=5, val_interval=2)

**_optimizer_**
optim_wrapper = dict(optimizer=dict(
    type='Adam',
    lr=5e-4,
))

**_learning policy_**
param_scheduler = [
    dict(
        type='LinearLR', begin=0, end=500, start_factor=0.001,
        by_epoch=False),  # warm-up
    dict(
        type='MultiStepLR',
        begin=0,
        end=210,
        milestones=[170, 200],
        gamma=0.1,
        by_epoch=True)
]

**automatically scaling LR based on the actual training batch size**
auto_scale_lr = dict(base_batch_size=512)

**hooks**
default_hooks = dict(checkpoint=dict(save_best='coco/AP', rule='greater'))

** codec settings**
codec = dict(
    type='MSRAHeatmap', input_size=(192, 256), heatmap_size=(48, 64), sigma=2)

**model settings**
model = dict(
    type='TopdownPoseEstimator',
    data_preprocessor=dict(
        type='PoseDataPreprocessor',
        mean=[123.675, 116.28, 103.53],
        std=[58.395, 57.12, 57.375],
        bgr_to_rgb=True),
    backbone=dict(
        type='HRNet',
        in_channels=3,
        extra=dict(
            stage1=dict(
                num_modules=1,
                num_branches=1,
                block='BOTTLENECK',
                num_blocks=(4, ),
                num_channels=(64, )),
            stage2=dict(
                num_modules=1,
                num_branches=2,
                block='BASIC',
                num_blocks=(4, 4),
                num_channels=(32, 64)),
            stage3=dict(
                num_modules=4,
                num_branches=3,
                block='BASIC',
                num_blocks=(4, 4, 4),
                num_channels=(32, 64, 128)),
            stage4=dict(
                num_modules=3,
                num_branches=4,
                block='BASIC',
                num_blocks=(4, 4, 4, 4),
                num_channels=(32, 64, 128, 256))),
        init_cfg=None,
    ),
    head=dict(
        type='HeatmapHead',
        in_channels=32,
        out_channels=100,
        deconv_out_channels=None,
        loss=dict(type='KeypointMSELoss', use_target_weight=True),
        decoder=codec),
    test_cfg=dict(
        flip_test=True,
        flip_mode='heatmap',
        shift_heatmap=True,
    ))

** base dataset settings**
dataset_type = 'CocoDataset'
data_mode = 'topdown'
data_root = 'data/plants/'

__** pipelines**__
train_pipeline = [
    dict(type='LoadImage', file_client_args={{_base_.file_client_args}}),
    dict(type='GetBBoxCenterScale'),
    dict(type='RandomFlip', direction='horizontal'),
    dict(type='RandomHalfBody'),
    dict(type='RandomBBoxTransform'),
    dict(type='TopdownAffine', input_size=codec['input_size']),
    dict(type='GenerateTarget', target_type='heatmap', encoder=codec),
    dict(type='PackPoseInputs')
]
val_pipeline = [
    dict(type='LoadImage', file_client_args={{_base_.file_client_args}}),
    dict(type='GetBBoxCenterScale'),
    dict(type='TopdownAffine', input_size=codec['input_size']),
    dict(type='PackPoseInputs')
]

**data loaders**
train_dataloader = dict(
    batch_size=64,
    num_workers=2,
    persistent_workers=True,
    sampler=dict(type='DefaultSampler', shuffle=True),
    dataset=dict(
        type=dataset_type,
        data_root=data_root,
        data_mode=data_mode,
        ann_file='annotations/converted-batch3.json',
        data_prefix=dict(img='images/'),
        pipeline=train_pipeline,
    ))
val_dataloader = dict(
    batch_size=32,
    num_workers=2,
    persistent_workers=True,
    drop_last=False,
    sampler=dict(type='DefaultSampler', shuffle=False, round_up=False),
    dataset=dict(
        type=dataset_type,
        data_root=data_root,
        data_mode=data_mode,
        ann_file='annotations/converted-test-14.json',
       # bbox_file='data/plants/person_detection_results/'
       # 'test-bbox.json',
        data_prefix=dict(img='validation/'),
        test_mode=True,
        pipeline=val_pipeline,
    ))
test_dataloader = val_dataloader

** evaluators**
val_evaluator = dict(
    type='CocoMetric',
    ann_file=data_root + 'annotations/converted-test-14.json')
test_evaluator = val_evaluator
LjIA26 commented 1 year ago

Edited previous comment to improve readability

Ben-Louis commented 1 year ago

I believe the issue is that CocoDataset utilizes the meta information of the COCO dataset, in which instances have 17 keypoints, and thus it adopts the flip_index defined for COCO. When an instance is flipped horizontally, only 17 keypoints in the same order as flip_index will remain.

In MMPose 1.0, CocoDataset will use dataset information defined in configs/_base_/datasets/coco.py by default. If you want to use another dataset meta information file such as configs/_base_/datasets/plants4.py, you need to specify it like:

train_dataloader = dict(
    batch_size=64,
    num_workers=2,
    persistent_workers=True,
    sampler=dict(type='DefaultSampler', shuffle=True),
    dataset=dict(
        type=dataset_type,
        data_root=data_root,
        data_mode=data_mode,
        metainfo=dict(from_file='configs/_base_/datasets/plants4.py'),
        ann_file='annotations/converted-batch3.json',
        data_prefix=dict(img='images/'),
        pipeline=train_pipeline,
    ))
val_dataloader = dict(
    batch_size=32,
    num_workers=2,
    persistent_workers=True,
    drop_last=False,
    sampler=dict(type='DefaultSampler', shuffle=False, round_up=False),
    dataset=dict(
        type=dataset_type,
        data_root=data_root,
        data_mode=data_mode,
        ann_file='annotations/converted-test-14.json',
        metainfo=dict(from_file='configs/_base_/datasets/plants4.py'), 
       # bbox_file='data/plants/person_detection_results/'
       # 'test-bbox.json',
        data_prefix=dict(img='validation/'),
        test_mode=True,
        pipeline=val_pipeline,
    ))
test_dataloader = val_dataloader

Thank you for pointing out this problem. We will refine the documents to ensure that the instructions are more explicit.

LjIA26 commented 1 year ago

Thank you! One more question. It is not printing the train accuracy data, only the validation result. How do I turn that on?

Ben-Louis commented 1 year ago

It is not printing the train accuracy data, only the validation result

Does the program prints training loss? If not, you can try to reduce the log interval by setting

default_hooks = dict(
    logger=dict(type='LoggerHook', interval=1),
    checkpoint=dict(save_best='coco/AP', rule='greater')
)

in your config.

LjIA26 commented 1 year ago

That worked, thank you. Training is slower btw and for some reason, every epoch is now divided in 4 subepochs, and they are all printed like 1 [1/4], 1 [2/4], 1 [3/4], 1 [4/4].... and then to the next epoch. It didn't do that last version. Why is it different now?

jin-s13 commented 1 year ago

4 means you have 4 iterations in each epoch. logger=dict(type='LoggerHook', interval=1), interval=1 means print the log every iteration. By default, the number is larger than 4, so you can't see the prints.