open-mmlab / mmpose

OpenMMLab Pose Estimation Toolbox and Benchmark.
https://mmpose.readthedocs.io/en/latest/
Apache License 2.0
5.55k stars 1.21k forks source link

[Bug] 'PoseLocalVisualizer is not in the visualizer registry. #2518

Closed korneliaWatson closed 1 year ago

korneliaWatson commented 1 year ago

Prerequisite

Environment

/opt/conda/lib/python3.10/site-packages/torch/cuda/init.py:107: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.) return torch._C._cuda_getDeviceCount() > 0 Traceback (most recent call last): File "", line 1, in File "/opt/conda/lib/python3.10/site-packages/mmcv/utils/env.py", line 72, in collect_env from mmcv.ops import get_compiler_version, get_compiling_cuda_version File "/opt/conda/lib/python3.10/site-packages/mmcv/ops/init.py", line 2, in from .active_rotated_filter import active_rotated_filter File "/opt/conda/lib/python3.10/site-packages/mmcv/ops/active_rotated_filter.py", line 10, in ext_module = ext_loader.load_ext( File "/opt/conda/lib/python3.10/site-packages/mmcv/utils/ext_loader.py", line 13, in load_ext ext = importlib.import_module('mmcv.' + name) File "/opt/conda/lib/python3.10/importlib/init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) ImportError: /opt/conda/lib/python3.10/site-packages/mmcv/_ext.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops6narrow4callERKNS_6TensorElll

mmcv 2.0.1 mmdet 3.1.0 mmengine 0.8.0

Reproduces the problem - code sample

python mmpose/tools/train.py /home/user/project/mmpose/configs/posturescan_2d_keypoint/training_config_front.py

Reproduces the problem - command or script

python mmpose/tools/train.py /home/user/project/mmpose/configs/posturescan_2d_keypoint/training_config_front.py

Reproduces the problem - error message

  return torch._C._cuda_getDeviceCount() > 0
07/05 15:14:53 - mmengine - WARNING - Failed to import `None.registry` make sure the registry.py exists in `None` package.
07/05 15:14:53 - mmengine - WARNING - Failed to search registry with scope "mmpose" in the "log_processor" registry tree. As a workaround, the current "log_processor" registry in "mmengine" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmpose" is a correct scope, or whether the registry is initialized.
07/05 15:14:53 - mmengine - INFO - 
------------------------------------------------------------
System environment:
    sys.platform: linux
    Python: 3.10.10 | packaged by conda-forge | (main, Mar 24 2023, 20:08:06) [GCC 11.3.0]
    CUDA available: False
    numpy_random_seed: 617966571
    GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
    PyTorch: 2.0.1+cu117
    PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.7, CUDNN_VERSION=8.5.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

    TorchVision: 0.15.2+cu117
    OpenCV: 4.6.0
    MMEngine: 0.8.0

Runtime environment:
    cudnn_benchmark: False
    mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
    dist_cfg: {'backend': 'nccl'}
    seed: 617966571
    Distributed launcher: none
    Distributed training: False
    GPU number: 1
------------------------------------------------------------

07/05 15:14:53 - mmengine - INFO - Config:
default_scope = 'mmpose'
default_hooks = dict(
    timer=dict(type='IterTimerHook'),
    logger=dict(type='LoggerHook', interval=50),
    param_scheduler=dict(type='ParamSchedulerHook'),
    checkpoint=dict(
        type='CheckpointHook',
        interval=10,
        save_best='coco/AP',
        rule='greater'),
    sampler_seed=dict(type='DistSamplerSeedHook'),
    visualization=dict(type='PoseVisualizationHook', enable=False))
custom_hooks = [
    dict(type='SyncBuffersHook'),
]
env_cfg = dict(
    cudnn_benchmark=False,
    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
    dist_cfg=dict(backend='nccl'))
vis_backends = [
    dict(type='LocalVisBackend'),
]
visualizer = dict(
    type='PoseLocalVisualizer',
    vis_backends=[
        dict(type='LocalVisBackend'),
    ],
    name='visualizer')
log_processor = dict(
    type='LogProcessor', window_size=50, by_epoch=True, num_digits=6)
log_level = 'INFO'
load_from = None
resume = False
backend_args = dict(backend='local')
train_cfg = dict(by_epoch=True, max_epochs=210, val_interval=10)
val_cfg = dict()
test_cfg = dict()
optim_wrapper = dict(optimizer=dict(type='Adam', lr=0.0005))
param_scheduler = [
    dict(
        type='LinearLR', begin=0, end=500, start_factor=0.001, by_epoch=False),
    dict(
        type='MultiStepLR',
        begin=0,
        end=210,
        milestones=[
            170,
            200,
        ],
        gamma=0.1,
        by_epoch=True),
]
auto_scale_lr = dict(base_batch_size=512)
codec = dict(
    type='MSRAHeatmap',
    input_size=(
        192,
        256,
    ),
    heatmap_size=(
        48,
        64,
    ),
    sigma=2)
model = dict(
    type='TopdownPoseEstimator',
    data_preprocessor=dict(
        type='PoseDataPreprocessor',
        mean=[
            123.675,
            116.28,
            103.53,
        ],
        std=[
            58.395,
            57.12,
            57.375,
        ],
        bgr_to_rgb=True),
    backbone=dict(
        type='HRNet',
        in_channels=3,
        extra=dict(
            stage1=dict(
                num_modules=1,
                num_branches=1,
                block='BOTTLENECK',
                num_blocks=(4, ),
                num_channels=(64, )),
            stage2=dict(
                num_modules=1,
                num_branches=2,
                block='BASIC',
                num_blocks=(
                    4,
                    4,
                ),
                num_channels=(
                    32,
                    64,
                )),
            stage3=dict(
                num_modules=4,
                num_branches=3,
                block='BASIC',
                num_blocks=(
                    4,
                    4,
                    4,
                ),
                num_channels=(
                    32,
                    64,
                    128,
                )),
            stage4=dict(
                num_modules=3,
                num_branches=4,
                block='BASIC',
                num_blocks=(
                    4,
                    4,
                    4,
                    4,
                ),
                num_channels=(
                    32,
                    64,
                    128,
                    256,
                ))),
        init_cfg=dict(
            type='Pretrained',
            checkpoint=
            'https://download.openmmlab.com/mmpose/pretrain_models/hrnet_w32-36af842e.pth'
        )),
    head=dict(
        type='HeatmapHead',
        in_channels=32,
        out_channels=39,
        deconv_out_channels=None,
        loss=dict(type='KeypointMSELoss', use_target_weight=True),
        decoder=dict(
            type='MSRAHeatmap',
            input_size=(
                192,
                256,
            ),
            heatmap_size=(
                48,
                64,
            ),
            sigma=2)),
    test_cfg=dict(flip_test=True, flip_mode='heatmap', shift_heatmap=True))
dataset_type = 'CocoPostureScanFront'
data_mode = 'topdown'
data_root = 'coco_ps'
train_pipeline = [
    dict(type='LoadImage'),
    dict(type='GetBBoxCenterScale'),
    dict(type='RandomFlip', direction='horizontal'),
    dict(type='RandomHalfBody'),
    dict(type='RandomBBoxTransform'),
    dict(type='TopdownAffine', input_size=(
        192,
        256,
    )),
    dict(
        type='GenerateTarget',
        encoder=dict(
            type='MSRAHeatmap',
            input_size=(
                192,
                256,
            ),
            heatmap_size=(
                48,
                64,
            ),
            sigma=2)),
    dict(type='PackPoseInputs'),
]
val_pipeline = [
    dict(type='LoadImage'),
    dict(type='GetBBoxCenterScale'),
    dict(type='TopdownAffine', input_size=(
        192,
        256,
    )),
    dict(type='PackPoseInputs'),
]
train_dataloader = dict(
    batch_size=64,
    num_workers=2,
    persistent_workers=True,
    sampler=dict(type='DefaultSampler', shuffle=True),
    dataset=dict(
        type='CocoPostureScanFront',
        data_root='coco_ps',
        data_mode='topdown',
        ann_file='front_train.json',
        data_prefix=dict(img='images/'),
        pipeline=[
            dict(type='LoadImage'),
            dict(type='GetBBoxCenterScale'),
            dict(type='RandomFlip', direction='horizontal'),
            dict(type='RandomHalfBody'),
            dict(type='RandomBBoxTransform'),
            dict(type='TopdownAffine', input_size=(
                192,
                256,
            )),
            dict(
                type='GenerateTarget',
                encoder=dict(
                    type='MSRAHeatmap',
                    input_size=(
                        192,
                        256,
                    ),
                    heatmap_size=(
                        48,
                        64,
                    ),
                    sigma=2)),
            dict(type='PackPoseInputs'),
        ]))
val_dataloader = dict(
    batch_size=32,
    num_workers=2,
    persistent_workers=True,
    drop_last=False,
    sampler=dict(type='DefaultSampler', shuffle=False, round_up=False),
    dataset=dict(
        type='CocoPostureScanFront',
        data_root='coco_ps',
        data_mode='topdown',
        ann_file='front_val.json',
        bbox_file=None,
        data_prefix=dict(img='images/'),
        test_mode=True,
        pipeline=[
            dict(type='LoadImage'),
            dict(type='GetBBoxCenterScale'),
            dict(type='TopdownAffine', input_size=(
                192,
                256,
            )),
            dict(type='PackPoseInputs'),
        ]))
test_dataloader = dict(
    batch_size=32,
    num_workers=2,
    persistent_workers=True,
    drop_last=False,
    sampler=dict(type='DefaultSampler', shuffle=False, round_up=False),
    dataset=dict(
        type='CocoPostureScanFront',
        data_root='coco_ps',
        data_mode='topdown',
        ann_file='front_val.json',
        bbox_file=None,
        data_prefix=dict(img='images/'),
        test_mode=True,
        pipeline=[
            dict(type='LoadImage'),
            dict(type='GetBBoxCenterScale'),
            dict(type='TopdownAffine', input_size=(
                192,
                256,
            )),
            dict(type='PackPoseInputs'),
        ]))
val_evaluator = dict(type='CocoMetric', ann_file='coco_ps/front_val.json')
test_evaluator = dict(type='CocoMetric', ann_file='coco_ps/front_val.json')
work_dir = 'mmpose/work_dirs/front/hrnet_w32_coco_256x192'
seed = 0
launcher = 'none'

07/05 15:14:53 - mmengine - WARNING - Failed to import `None.registry` make sure the registry.py exists in `None` package.
07/05 15:14:53 - mmengine - WARNING - Failed to search registry with scope "mmpose" in the "visualizer" registry tree. As a workaround, the current "visualizer" registry in "mmengine" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmpose" is a correct scope, or whether the registry is initialized.
Traceback (most recent call last):
  File "/home/user/project/mmpose/tools/train.py", line 160, in <module>
    main()
  File "/home/user/project/mmpose/tools/train.py", line 153, in main
    runner = Runner.from_cfg(cfg)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/runner.py", line 443, in from_cfg
    runner = cls(
  File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/runner.py", line 397, in __init__
    self.visualizer = self.build_visualizer(visualizer)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/runner.py", line 784, in build_visualizer
    return VISUALIZERS.build(visualizer)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 100, in build_from_cfg
    raise KeyError(
KeyError: 'PoseLocalVisualizer is not in the visualizer registry. Please check whether the value of `PoseLocalVisualizer` is correct or it was registered as expected. More details can be found at https://mmengine.readthedocs.io/en/latest/advanced_tutorials/config.html#import-the-custom-module'

Additional information

I had the training script working previously. All I did was remove my virtual environment to do a fresh install of dependencies. Since then, I couldn't get the script working and I've been experiencing a range of issues that got me circulating in a ring of madness.

I've had several issues getting dependencies working nicely together. After several trial and errors these are the versions that finally managed to install without producing errors.

apache-airflow==2.6.1
black==23.3.0
fiftyone==0.20.1
ipywidgets==8.0.6
matplotlib==3.7.1
mmcv==2.0.1
mmdet==3.1.0
mmengine==0.8.0
munkres==1.1.4
numpy==1.23.5
opencv-python==4.6.0.66
opencv-python-headless==4.7.0.72
pandas==2.0.1
Pillow==9.5.0
pydantic==1.10.11
scikit-learn==1.2.2
scipy==1.10.1
seaborn==0.12.2
torch==2.0.1
torchaudio==2.0.2
torchvision==0.15.2
xtcocotools==1.13
PyYAML==6.0

Output of nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Feb__7_19:32:13_PST_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0

I would appreciate help with getting this issue resolved. Thank you.

Ben-Louis commented 1 year ago

Hi, thanks for using MMPose. You may need to install mmpose before training via pip install -e /home/user/project/mmpose

korneliaWatson commented 1 year ago

Right, thank you. After running pip install -e /home/user/project/mmpose I now get another error.

/opt/conda/lib/python3.10/site-packages/mmcv/cnn/bricks/transformer.py:33: UserWarning: Fail to import ``MultiScaleDeformableAttention`` from ``mmcv.ops.multi_scale_deform_attn``, You should install ``mmcv`` rather than ``mmcv-lite`` if you need this module. 
  warnings.warn('Fail to import ``MultiScaleDeformableAttention`` from '
07/06 04:13:29 - mmengine - INFO - Distributed training is not used, all SyncBatchNorm (SyncBN) layers in the model will be automatically reverted to BatchNormXd layers if they are used.
Traceback (most recent call last):
  File "/home/user/project/mmpose/tools/train.py", line 160, in <module>
    main()
  File "/home/user/project/mmpose/tools/train.py", line 153, in main
    runner = Runner.from_cfg(cfg)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/runner.py", line 443, in from_cfg
    runner = cls(
  File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/runner.py", line 412, in __init__
    self.model = self.wrap_model(
  File "/opt/conda/lib/python3.10/site-packages/mmengine/runner/runner.py", line 862, in wrap_model
    model = revert_sync_batchnorm(model)
  File "/opt/conda/lib/python3.10/site-packages/mmengine/model/utils.py", line 174, in revert_sync_batchnorm
    from mmcv.ops import SyncBatchNorm
  File "/opt/conda/lib/python3.10/site-packages/mmcv/ops/__init__.py", line 2, in <module>
    from .active_rotated_filter import active_rotated_filter
  File "/opt/conda/lib/python3.10/site-packages/mmcv/ops/active_rotated_filter.py", line 10, in <module>
    ext_module = ext_loader.load_ext(
  File "/opt/conda/lib/python3.10/site-packages/mmcv/utils/ext_loader.py", line 13, in load_ext
    ext = importlib.import_module('mmcv.' + name)
  File "/opt/conda/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
ImportError: /opt/conda/lib/python3.10/site-packages/mmcv/_ext.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops6narrow4callERKNS_6TensorElll

I have mmcv installed, though, version 2.0.1.

Ben-Louis commented 1 year ago

According to the mmcv FAQ, there might be mismatch between your pytorch version and that for compiling mmcv. We would suggest reinstalling mmcv or build mmcv from source

korneliaWatson commented 1 year ago

Hi, thank you. That solved the problem.

I'm now experiencing a new one... I seem to manage to execute 10 first training iterations but once a checkpoint is saved, the training stops with Killed output. I looked at training config but I'm not seeing anything that would indicate it's supposed to terminate. Any ideas why that might be happening?

07/11 11:43:15 - mmengine - INFO - Epoch(val) [10][10/10]    coco/AP: 1.000000  coco/AP .5: 1.000000  coco/AP .75: 1.000000  coco/AP (M): 1.000000  coco/AP (L): 1.000000  coco/AR: 1.000000  coco/AR .5: 1.000000  coco/AR .75: 1.000000  coco/AR (M): 1.000000  coco/AR (L): 1.000000  data_time: 0.056178  time: 6.020677
07/11 11:43:19 - mmengine - INFO - The best checkpoint with 1.0000 coco/AP at 10 epoch is saved to best_coco_AP_epoch_10.pth.
Killed
Ben-Louis commented 1 year ago

In Linux, a process can be terminated when there is not enough memory available. To address this issue, it would be beneficial to monitor the memory usage of the training process and detect any occurrence of OOM (out of memory) problems.