Closed WinnerMeat closed 1 year ago
06/14 10:13:13 - mmengine - INFO -
------------------------------------------------------------
System environment:
sys.platform: win32
Python: 3.8.16 | packaged by conda-forge | (default, Feb 1 2023, 15:53:35) [MSC v.1929 64 bit (AMD64)]
CUDA available: True
numpy_random_seed: 1034294422
GPU 0,1: NVIDIA GeForce RTX 3090
CUDA_HOME: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6
NVCC: Cuda compilation tools, release 11.6, V11.6.55
MSVC: 用于 x64 的 Microsoft (R) C/C++ 优化编译器 19.29.30148 版
GCC: n/a
PyTorch: 1.13.1+cu116
PyTorch compiling details: PyTorch built with:
- C++ Version: 199711
- MSVC 192829337
- Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
- OpenMP 2019
- LAPACK is enabled (usually provided by MKL)
- CPU capability usage: AVX2
- CUDA Runtime 11.6
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
- CuDNN 8.3.2 (built against CUDA 11.5)
- Magma 2.5.4
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.6, CUDNN_VERSION=8.3.2, CXX_COMPILER=C:/actions-runner/_work/pytorch/pytorch/builder/windows/tmp_bin/sccache-cl.exe, CXX_FLAGS=/DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj -DUSE_PTHREADPOOL -openmp:experimental -IC:/actions-runner/_work/pytorch/pytorch/builder/windows/mkl/include -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.13.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=OFF, USE_OPENMP=ON, USE_ROCM=OFF,
TorchVision: 0.14.1+cu116
OpenCV: 4.7.0
MMEngine: 0.7.3
Runtime environment:
cudnn_benchmark: False
mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
dist_cfg: {'backend': 'nccl'}
seed: None
Distributed launcher: none
Distributed training: False
GPU number: 1
------------------------------------------------------------
06/14 10:13:14 - mmengine - INFO - Config:
default_scope = 'mmpose'
default_hooks = dict(
timer=dict(type='IterTimerHook'),
logger=dict(type='LoggerHook', interval=50),
param_scheduler=dict(type='ParamSchedulerHook'),
checkpoint=dict(
type='CheckpointHook',
interval=10,
save_best='crowdpose/AP',
rule='greater'),
sampler_seed=dict(type='DistSamplerSeedHook'),
visualization=dict(type='PoseVisualizationHook', enable=False))
custom_hooks = [dict(type='SyncBuffersHook')]
env_cfg = dict(
cudnn_benchmark=False,
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
dist_cfg=dict(backend='nccl'))
vis_backends = [dict(type='LocalVisBackend')]
visualizer = dict(
type='PoseLocalVisualizer',
vis_backends=[dict(type='LocalVisBackend')],
name='visualizer')
log_processor = dict(
type='LogProcessor', window_size=50, by_epoch=True, num_digits=6)
log_level = 'INFO'
load_from = None
resume = False
backend_args = dict(backend='local')
train_cfg = dict(by_epoch=True, max_epochs=100, val_interval=1)
val_cfg = dict()
test_cfg = dict()
optim_wrapper = dict(optimizer=dict(type='Adam', lr=0.0005))
param_scheduler = [
dict(
type='LinearLR', begin=0, end=500, start_factor=0.001, by_epoch=False),
dict(
type='MultiStepLR',
begin=0,
end=100,
milestones=[170, 200],
gamma=0.1,
by_epoch=True)
]
auto_scale_lr = dict(base_batch_size=512, enable=True)
codec = dict(
type='RegressionLabel',
input_size=(256, 256),
heatmap_size=(64, 64),
sigma=2.0,
normalize=True)
model = dict(
type='TopdownPoseEstimator',
data_preprocessor=dict(
type='PoseDataPreprocessor',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
bgr_to_rgb=True),
backbone=dict(type='ResNet', depth=50),
head=dict(
type='DSNTHead',
in_channels=2048,
in_featuremap_size=(8, 8),
num_joints=14,
loss=dict(
type='MultipleLossWrapper',
losses=[
dict(type='SmoothL1Loss', use_target_weight=True),
dict(type='JSDiscretLoss', use_target_weight=True)
]),
decoder=dict(
type='RegressionLabel',
input_size=(256, 256),
heatmap_size=(64, 64),
sigma=2.0,
normalize=True)),
test_cfg=dict(flip_test=True, shift_coords=True, shift_heatmap=True),
init_cfg=dict(
type='Pretrained',
checkpoint=
'https://download.openmmlab.com/mmpose/pretrain_models/td-hm_res50_8xb64-210e_coco-256x192.pth'
))
dataset_type = 'CrowdPoseDataset'
data_mode = 'topdown'
data_root = 'data/crowdpose/'
train_pipeline = [
dict(type='LoadImage'),
dict(type='GetBBoxCenterScale'),
dict(type='RandomFlip', direction='horizontal'),
dict(type='RandomHalfBody'),
dict(type='RandomBBoxTransform'),
dict(type='TopdownAffine', input_size=(256, 256)),
dict(
type='GenerateTarget',
encoder=dict(
type='RegressionLabel',
input_size=(256, 256),
heatmap_size=(64, 64),
sigma=2.0,
normalize=True)),
dict(type='PackPoseInputs')
]
test_pipeline = [
dict(type='LoadImage'),
dict(type='GetBBoxCenterScale'),
dict(type='TopdownAffine', input_size=(256, 256)),
dict(type='PackPoseInputs')
]
train_dataloader = dict(
batch_size=2,
num_workers=2,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=True),
dataset=dict(
type='CrowdPoseDataset',
data_root='../data_set/crowdpose/',
data_mode='topdown',
ann_file='annotations/mmpose_crowdpose_train.json',
data_prefix=dict(img='images/'),
pipeline=[
dict(type='LoadImage'),
dict(type='GetBBoxCenterScale'),
dict(type='RandomFlip', direction='horizontal'),
dict(type='RandomHalfBody'),
dict(type='RandomBBoxTransform'),
dict(type='TopdownAffine', input_size=(256, 256)),
dict(
type='GenerateTarget',
encoder=dict(
type='RegressionLabel',
input_size=(256, 256),
heatmap_size=(64, 64),
sigma=2.0,
normalize=True)),
dict(type='PackPoseInputs')
]))
val_dataloader = dict(
batch_size=2,
num_workers=2,
persistent_workers=True,
drop_last=False,
sampler=dict(type='DefaultSampler', shuffle=False, round_up=False),
dataset=dict(
type='CrowdPoseDataset',
data_root='../data_set/crowdpose/',
data_mode='topdown',
ann_file='annotations/mmpose_crowdpose_test.json',
bbox_file=
'../data_set/crowdpose/annotations/det_for_crowd_test_0.1_0.5.json',
data_prefix=dict(img='images/'),
test_mode=True,
pipeline=[
dict(type='LoadImage'),
dict(type='GetBBoxCenterScale'),
dict(type='TopdownAffine', input_size=(256, 256)),
dict(type='PackPoseInputs')
]))
test_dataloader = dict(
batch_size=32,
num_workers=2,
persistent_workers=True,
drop_last=False,
sampler=dict(type='DefaultSampler', shuffle=False, round_up=False),
dataset=dict(
type='CrowdPoseDataset',
data_root='../data_set/crowdpose/',
data_mode='topdown',
ann_file='annotations/mmpose_crowdpose_test.json',
bbox_file=
'../data_set/crowdpose/annotations/det_for_crowd_test_0.1_0.5.json',
data_prefix=dict(img='images/'),
test_mode=True,
pipeline=[
dict(type='LoadImage'),
dict(type='GetBBoxCenterScale'),
dict(type='TopdownAffine', input_size=(256, 256)),
dict(type='PackPoseInputs')
]))
val_evaluator = dict(
type='CocoMetric',
ann_file='../data_set/crowdpose/annotations/mmpose_crowdpose_test.json')
test_evaluator = dict(
type='CocoMetric',
ann_file='../data_set/crowdpose/annotations/mmpose_crowdpose_test.json')
launcher = 'none'
work_dir = './myresult/ipr_DSNT_crowdpose/'
Since you use a very small batch size, we suggest using a lower learning rate accordingly. By the way, the small batch size will also result in the instability of BN layers
@Ben-Louis listened to your suggestions and tried using a large batch batch_size=64 with a reasonable learning rate, but still got the same result.
06/25 16:44:25 - mmengine - INFO - Epoch(train) [1][ 50/282] lr: 6.193637e-06 eta: 5:18:16 time: 0.678371 data_time: 0.236657 memory: 8585 loss: nan loss_kpt: nan acc_pose: 0.000000
06/25 16:44:47 - mmengine - INFO - Epoch(train) [1][100/282] lr: 1.244990e-05 eta: 4:22:11 time: 0.441295 data_time: 0.075030 memory: 8585 loss: nan loss_kpt: nan acc_pose: 0.000000
06/25 16:45:08 - mmengine - INFO - Epoch(train) [1][150/282] lr: 1.870616e-05 eta: 4:00:07 time: 0.421244 data_time: 0.040561 memory: 8585 loss: nan loss_kpt: nan acc_pose: 0.000000
06/25 16:45:29 - mmengine - INFO - Epoch(train) [1][200/282] lr: 2.496242e-05 eta: 3:48:34 time: 0.418337 data_time: 0.027817 memory: 8585 loss: nan loss_kpt: nan acc_pose: 0.000000
06/25 16:45:50 - mmengine - INFO - Epoch(train) [1][250/282] lr: 3.121869e-05 eta: 3:42:06 time: 0.424828 data_time: 0.046276 memory: 8585 loss: nan loss_kpt: nan acc_pose: 0.000000
06/25 16:46:03 - mmengine - INFO - Exp name: ipr_res50_dsnt-8xb64-210e_coco-256x256_20230625_164329
06/25 16:46:20 - mmengine - INFO - Epoch(val) [1][ 50/1008] eta: 0:05:20 time: 0.334142 data_time: 0.205168 memory: 8585
06/25 16:48:56 - mmengine - INFO - Epoch(val) [1][1000/1008] eta: 0:00:01 time: 0.168518 data_time: 0.047292 memory: 1005
06/25 16:49:00 - mmengine - INFO - Evaluating CocoMetric...
Loading and preparing results...
DONE (t=0.20s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *keypoints_crowd*
DONE (t=2.57s).
Accumulating evaluation results...
DONE (t=0.08s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets= 20 ] = 0.277
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets= 20 ] = 0.277
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets= 20 ] = 0.277
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 20 ] = 0.272
Average Recall (AR) @[ IoU=0.50 | area= all | maxDets= 20 ] = 0.272
Average Recall (AR) @[ IoU=0.75 | area= all | maxDets= 20 ] = 0.272
Average Precision (AP) @[ IoU=0.50:0.95 | type= easy | maxDets= 20 ] = 0.416
Average Precision (AP) @[ IoU=0.50:0.95 | type=medium | maxDets= 20 ] = 0.257
Average Precision (AP) @[ IoU=0.50:0.95 | type= hard | maxDets= 20 ] = 0.247
06/25 16:49:06 - mmengine - INFO - Epoch(val) [1][1008/1008] crowdpose/AP: 0.277228 crowdpose/AP .5: 0.277228 crowdpose/AP .75: 0.277228 crowdpose/AR: 0.272014 crowdpose/AR .5: 0.272014 crowdpose/AR .75: 0.272014 crowdpose/AP(E): 0.415800 crowdpose/AP(M): 0.257400 crowdpose/AP(H): 0.247500 data_time: 0.048809 time: 0.172888
06/25 16:49:09 - mmengine - INFO - The best checkpoint with 0.2772 crowdpose/AP at 1 epoch is saved to best_crowdpose_AP_epoch_1.pth.
06/25 16:49:31 - mmengine - INFO - Epoch(train) [2][ 50/282] lr: 4.147896e-05 eta: 3:35:30 time: 0.440419 data_time: 0.074551 memory: 8585 loss: nan loss_kpt: nan acc_pose: 0.000000
06/25 16:49:52 - mmengine - INFO - Epoch(train) [2][100/282] lr: 4.773522e-05 eta: 3:32:49 time: 0.426083 data_time: 0.047690 memory: 8585 loss: nan loss_kpt: nan acc_pose: 0.000000
Sorry but I tried DSNT on CrowdPose with batch size 64 and GPU 1 and 8, and found that the training loss is normal. Did you modify any code?
@Ben-Louis OrderedDict([('sys.platform', 'win32'), ('Python', '3.8.16 (default, Jun 12 2023, 21:00:42) [MSC v.1916 64 bit (AMD64)]'), ('CUDA available', True), ('numpy_random_seed', 2147483648), ('GPU 0,1', 'NVIDIA GeForce RTX 3080'), ('CUDA_HOME', 'C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4'), ('NVCC', 'Cuda compilation tools, release 11.4, V11.4.120'), ('MSVC', '用于 x64 的 Microsoft (R) C/C++ 优化编译器 19.33.31630 版'), ('GCC', 'n/a'), ('PyTorch', '1.13.1+cu116'), ('PyTorch compiling details', 'PyTorch built with:\n - C++ Version: 199711\n - MSVC 192829337\n - Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications\n - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)\n - OpenMP 2019\n - LAPACK is enabled (usually provided by MKL)\n - CPU capability usage: AVX2\n - CUDA Runtime 11.6\n - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37\n - CuDNN 8.3.2 (built against CUDA 11.5)\n - Magma 2.5.4\n - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.6, CUDNN_VERSION=8.3.2, CXX_COMPILER=C:/actions-runner/_work/pytorch/pytorch/builder/windows/tmp_bin/sccache-cl.exe, CXX_FLAGS=/DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj -DUSE_PTHREADPOOL -openmp:experimental -IC:/actions-runner/_work/pytorch/pytorch/builder/windows/mkl/include -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.13.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=OFF, USE_OPENMP=ON, USE_ROCM=OFF, \n'), ('TorchVision', '0.14.1+cu116'), ('OpenCV', '4.7.0'), ('MMEngine', '0.7.4'), ('MMPose', '1.0.0+unknown')])
_base_ = ['../../../_base_/default_runtime.py']
# runtime
train_cfg = dict(max_epochs=100, val_interval=1)
# optimizer
optim_wrapper = dict(optimizer=dict(
type='Adam',
lr=5e-4,
))
# learning policy
param_scheduler = [
dict(
type='LinearLR', begin=0, end=500, start_factor=0.001,
by_epoch=False), # warm-up
dict(
type='MultiStepLR',
begin=0,
end=train_cfg['max_epochs'],
milestones=[170, 200],
gamma=0.1,
by_epoch=True)
]
# automatically scaling LR based on the actual training batch size
auto_scale_lr = dict(base_batch_size=512)
# codec settings
codec = dict(
type='IntegralRegressionLabel',
input_size=(256, 256),
heatmap_size=(64, 64),
sigma=2.0,
normalize=True)
# model settings
model = dict(
type='TopdownPoseEstimator',
data_preprocessor=dict(
type='PoseDataPreprocessor',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
bgr_to_rgb=True),
backbone=dict(
type='ResNet',
depth=50,
),
head=dict(
type='DSNTHead',
in_channels=2048,
in_featuremap_size=(8, 8),
num_joints=14,
loss=dict(
type='MultipleLossWrapper',
losses=[
dict(type='SmoothL1Loss', use_target_weight=True),
dict(type='JSDiscretLoss', use_target_weight=True)
]),
decoder=codec),
test_cfg=dict(
flip_test=True,
shift_coords=True,
shift_heatmap=True,
),
init_cfg=dict(
type='Pretrained',
checkpoint='https://download.openmmlab.com/mmpose/'
'pretrain_models/td-hm_res50_8xb64-210e_coco-256x192.pth'))
# base dataset settings
dataset_type = 'CrowdPoseDataset'
data_mode = 'topdown'
data_root = '../data_set/crowdpose/'
# pipelines
train_pipeline = [
dict(type='LoadImage'),
dict(type='GetBBoxCenterScale'),
dict(type='RandomFlip', direction='horizontal'),
dict(type='RandomHalfBody'),
dict(type='RandomBBoxTransform'),
dict(type='TopdownAffine', input_size=codec['input_size']),
dict(type='GenerateTarget', encoder=codec),
dict(type='PackPoseInputs')
]
test_pipeline = [
dict(type='LoadImage'),
dict(type='GetBBoxCenterScale'),
dict(type='TopdownAffine', input_size=codec['input_size']),
dict(type='PackPoseInputs')
]
# data loaders
train_dataloader = dict(
batch_size=64,
num_workers=2,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=True),
dataset=dict(
type=dataset_type,
data_root=data_root,
data_mode=data_mode,
ann_file='annotations/mmpose_crowdpose_train.json',
data_prefix=dict(img='images/'),
pipeline=train_pipeline,
))
val_dataloader = dict(
batch_size=32,
num_workers=2,
persistent_workers=True,
drop_last=False,
sampler=dict(type='DefaultSampler', shuffle=False, round_up=False),
dataset=dict(
type=dataset_type,
data_root=data_root,
data_mode=data_mode,
ann_file='annotations/mmpose_crowdpose_test.json',
bbox_file='../data_set/crowdpose/annotations/det_for_crowd_test_0.1_0.5.json',
data_prefix=dict(img='images/'),
test_mode=True,
pipeline=test_pipeline,
))
test_dataloader = val_dataloader
# hooks
default_hooks = dict(checkpoint=dict(save_best='crowdpose/AP', rule='greater'))
# evaluators
val_evaluator = dict(
type='CocoMetric',
ann_file='../data_set/crowdpose/annotations/mmpose_crowdpose_test.json',
use_area=False,
iou_type='keypoints_crowd',
prefix='crowdpose')
test_evaluator = val_evaluator
I didn't change the code, I assumed it was a machine problem, but when I tested it again on a new machine,
I got the same result, which was my machine environment and configuration file.
Not sure if there is another workaround.
Maybe you could try to replace 'annotations/mmpose_crowdpose_train.json' with 'annotations/mmpose_crowdpose_trainval.json'? We generally use the latter to train models on CrowdPose
Prerequisite
Environment
OrderedDict([('sys.platform', 'win32'), ('Python', '3.8.16 | packaged by conda-forge | (default, Feb 1 2023, 15:53:35) [MSC v.1929 64 bit (AMD64)]'), ('CUDA available', True), ('numpy_random_seed', 2147483648), ('GPU 0,1', 'NVIDIA GeForce RTX 3090'), ('CUDA_HOME', 'C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6'), ('NVCC', 'Cuda compilation tools, release 11.6, V11.6.55'), ('MSVC', '用于 x64 的 Microsoft (R) C/C++ 优化编译器 19.29.30148 版'), ('GCC', 'n/a'), ('PyTorch', '1.13.1+cu116'), ('PyTorch compiling details', 'PyTorch built with:\n - C++ Version: 199711\n - MSVC 192829337\n - Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications\n - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)\n - OpenMP 2019\n - LAPACK is enabled (usually provided by MKL)\n - CPU capability usage: AVX2\n - CUDA Runtime 11.6\n - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37\n - CuDNN 8.3.2 (built against CUDA 11.5)\n - Magma 2.5.4\n - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.6, CUDNN_VERSION=8.3.2, CXX_COMPILER=C:/actions-runner/_work/pytorch/pytorch/builder/windows/tmp_bin/sccache-cl.exe, CXX_FLAGS=/DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj -DUSE_PTHREADPOOL -openmp:experimental -IC:/actions-runner/_work/pytorch/pytorch/builder/windows/mkl/include -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.13.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=OFF, USE_OPENMP=ON, USE_ROCM=OFF, \n'), ('TorchVision', '0.14.1+cu116'), ('OpenCV', '4.7.0'), ('MMEngine', '0.7.3'), ('MMPose', '1.0.0+')])
Reproduces the problem - code sample
Reproduces the problem - command or script
Reproduces the problem - error message
Additional information
I tried to use the crowdpose data set to run on the model (configs/body_2d_keypoint/integral_regression/crowdpose/ipr_res50_dsnt-8xb64-210e_coco-256x256.py), but the training results were all invalid NAN; in addition, I also used the COCO2017 data set in the Training on the model yields the same results. I tried to find the cause of the error and found that after the first gradient update, the weight values of the model were all NAN. I didn't find a solution to this problem, so I filed this bug, hoping it will be answered.