363546178 commented 1 year ago

Hello! I use the SD dataset to train the line model, I prepare the dataset preparation command is this: ./tools/dist_train.sh ./projects/configs/vma_res152_e80_line.py 1

the error log is this: KeyError: Caught KeyError in DataLoader worker process 0

I tried to regenerate the SD line data set according to the method of docs/prepare_dataset.md, but the problem still occurred. How can I troubleshoot the cause of this problem? Thank you!

all error log is this: 2023-09-11 16:55:43,906 - mmdet - INFO - Saving checkpoint at 5 epochs [ ] 0/7, elapsed: 0s, ETA:Traceback (most recent call last): File "./tools/train.py", line 261, in main() File "./tools/train.py", line 250, in main custom_train_model( File "/home/pc01/code/VMA/projects/mmdet3d_plugin/bevformer/apis/train.py", line 27, in custom_train_model custom_train_detector( File "/home/pc01/code/VMA/projects/mmdet3d_plugin/bevformer/apis/mmdet_train.py", line 212, in custom_train_detector runner.run(data_loaders, cfg.workflow) File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run epoch_runner(data_loaders[i], **kwargs) File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 54, in train self.call_hook('after_train_epoch') File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook getattr(hook, fn_name)(self) File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/mmcv/runner/hooks/evaluation.py", line 267, in after_train_epoch self._do_evaluate(runner) File "/home/pc01/code/VMA/projects/mmdet3d_plugin/core/evaluation/eval_hooks.py", line 78, in _do_evaluate results = custom_multi_gpu_test( File "/home/pc01/code/VMA/projects/mmdet3d_plugin/bevformer/apis/test.py", line 71, in custom_multi_gpu_test for i, data in enumerate(data_loader): File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in next data = self._next_data() File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data return self._process_data(data) File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data data.reraise() File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/_utils.py", line 425, in reraise raise self.exc_type(msg) KeyError: Caught KeyError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop data = fetcher.fetch(index) File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch return self.collate_fn(data) File "/home/pc01/code/VMA/projects/mmdet3d_plugin/datasets/builder.py", line 173, in iCurb_collate data['seq'] = [x[0] for x in batch] File "/home/pc01/code/VMA/projects/mmdet3d_plugin/datasets/builder.py", line 173, in data['seq'] = [x[0] for x in batch] KeyError: 0

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 82570) of binary: /home/pc01/anaconda3/envs/vma/bin/python Traceback (most recent call last): File "/home/pc01/anaconda3/envs/vma/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/pc01/anaconda3/envs/vma/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run elastic_launch( File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

    ./tools/train.py FAILED

======================================= Root Cause: [0]: time: 2023-09-11_16:55:52 rank: 0 (local_rank: 0) exitcode: 1 (pid: 82570) error_file: <N/A> msg: "Process failed with exitcode 1"

Other Failures:

***************************************

zyc10ud commented 1 year ago

Thank you for pointing out the issue in our code. I have now fixed the problem, and you can pull the latest code and give it a try.

363546178 commented 1 year ago

Hello！ I am running the updated code here, and there is no crash problem like before. However, there seems to be a problem with the training output results. The evaluation results are all 0. What may be the reason for this? I am using a single card RTX3090 for training. The config change, system information and log are as follows:

config change:

data = dict( samples_per_gpu=1, #2 workers_per_gpu=1, #8

optimizer = dict( type='AdamW', lr=1.225e-5, paramwise_cfg=dict( custom_keys={ 'img_backbone': dict(lr_mult=0.1), }), weight_decay=0.005) #0.01

learning policy

lr_config = dict( policy='CosineAnnealing', warmup='linear', warmup_iters=1000, #500 warmup_ratio=1.0 / 3, min_lr_ratio=1e-3)

system information:

2023-09-12 16:11:31,322 - mmdet - INFO - Environment info:

sys.platform: linux Python: 3.8.17 (default, Jul 5 2023, 21:04:15) [GCC 11.2.0] CUDA available: True GPU 0: NVIDIA GeForce RTX 3090 CUDA_HOME: /usr/local/cuda NVCC: Build cuda_11.1.TC455_06.29069683_0 GCC: gcc (Ubuntu 9.5.0-1ubuntu1~22.04) 9.5.0 PyTorch: 1.9.1+cu111 PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.1.2 (Git Hash 98be7e8afa711dc9b66c8ff3504129cb82013cdb)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.1
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
CuDNN 8.0.5
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.10.1+cu111 OpenCV: 4.8.0 MMCV: 1.4.0 MMCV Compiler: GCC 9.5 MMCV CUDA Compiler: 11.1 MMDetection: 2.14.0 MMSegmentation: 0.14.1 MMDetection3D: 0.17.2+607562c

the log is this:

Cls data formatting done in 1.046210s!! with /home/pc01/code/VMA/val/work_dirs/vma_res152_e80_line/Tue_Sep_12_16_11_37_2023/pts_bbox/cls_formatted.pkl ----------use metric:chamfer---------- ----------threshhold:6---------- cls:lane done in 0.963024s!! cls:curb done in 0.016371s!! cls:stopline done in 0.001682s!!

+-----------------+-----------+--------+----------+ | class | precision | recall | f1_score | +-----------------+-----------+--------+----------+ | lane_direction | 0.0 | 0.0 | 0.0 | | lane_type | 0.0 | 0.0 | 0.0 | | lane_properties | 0.0 | 0.0 | 0.0 | | lane_flag | 0.0 | 0.0 | 0.0 | | curb_type | 0.0 | 0.0 | 0.0 | +-----------------+-----------+--------+----------+

+----------+-----+------+-----------+--------+-------+ | class | gts | dets | precision | recall | ap | +----------+-----+------+-----------+--------+-------+ | lane | 22 | 41 | 0.000 | 0.000 | 0.000 | | curb | 18 | 51 | 0.000 | 0.000 | 0.000 | | stopline | 0 | 0 | 0.000 | 0.000 | 0.000 | +----------+-----+------+-----------+--------+-------+ | mAP | | | | | 0.000 | +----------+-----+------+-----------+--------+-------+ ----------threshhold:15---------- cls:lane done in 0.964571s!! cls:curb done in 0.016190s!! cls:stopline done in 0.001796s!!

+-----------------+-----------+--------+----------+ | class | precision | recall | f1_score | +-----------------+-----------+--------+----------+ | lane_direction | 0.0 | 0.0 | 0.0 | | lane_type | 0.0 | 0.0 | 0.0 | | lane_properties | 0.0 | 0.0 | 0.0 | | lane_flag | 0.0 | 0.0 | 0.0 | | curb_type | 0.0 | 0.0 | 0.0 | +-----------------+-----------+--------+----------+

+----------+-----+------+-----------+--------+-------+ | class | gts | dets | precision | recall | ap | +----------+-----+------+-----------+--------+-------+ | lane | 22 | 41 | 0.000 | 0.000 | 0.000 | | curb | 18 | 51 | 0.000 | 0.000 | 0.000 | | stopline | 0 | 0 | 0.000 | 0.000 | 0.000 | +----------+-----+------+-----------+--------+-------+ | mAP | | | | | 0.000 | +----------+-----+------+-----------+--------+-------+ ----------threshhold:30---------- cls:lane done in 0.962664s!! cls:curb done in 0.017725s!! cls:stopline done in 0.001489s!!

+-----------------+-----------+--------+----------+ | class | precision | recall | f1_score | +-----------------+-----------+--------+----------+ | lane_direction | 0.0 | 0.0 | 0.0 | | lane_type | 0.0 | 0.0 | 0.0 | | lane_properties | 0.0 | 0.0 | 0.0 | | lane_flag | 0.0 | 0.0 | 0.0 | | curb_type | 0.0 | 0.0 | 0.0 | +-----------------+-----------+--------+----------+

+----------+-----+------+-----------+--------+-------+ | class | gts | dets | precision | recall | ap | +----------+-----+------+-----------+--------+-------+ | lane | 22 | 41 | 0.000 | 0.000 | 0.000 | | curb | 18 | 51 | 0.000 | 0.000 | 0.000 | | stopline | 0 | 0 | 0.000 | 0.000 | 0.000 | +----------+-----+------+-----------+--------+-------+ | mAP | | | | | 0.000 | +----------+-----+------+-----------+--------+-------+ lane: 0.0 curb: 0.0 stopline: 0.0 map: 0.0 2023-09-12 16:36:08,917 - mmdet - INFO - Exp name: vma_res152_e80_line.py 2023-09-12 16:36:08,917 - mmdet - INFO - Epoch(val) [80][7] SD_Line_Map_chamfer/lane_AP: 0.0000, SD_Line_Map_chamfer/curb_AP: 0.0000, SD_Line_Map_chamfer/stopline_AP: 0.0000, SD_Line_Map_chamfer/mAP: 0.0000, SD_Line_Map_chamfer/lane_AP_thr_6: 0.0000, SD_Line_Map_chamfer/lane_AP_thr_15: 0.0000, SD_Line_Map_chamfer/lane_AP_thr_30: 0.0000, SD_Line_Map_chamfer/curb_AP_thr_6: 0.0000, SD_Line_Map_chamfer/curb_AP_thr_15: 0.0000, SD_Line_Map_chamfer/curb_AP_thr_30: 0.0000, SD_Line_Map_chamfer/stopline_AP_thr_6: 0.0000, SD_Line_Map_chamfer/stopline_AP_thr_15: 0.0000, SD_Line_Map_chamfer/stopline_AP_thr_30: 0.0000

EchoQiHeng commented 1 year ago

I have also encountered this situation. Have you resolved it.

+-----------------+-----------+--------+----------+ | class | precision | recall | f1_score | +-----------------+-----------+--------+----------+ | lane_direction | 0.0 | 0.0 | 0.0 | | lane_type | 0.0 | 0.0 | 0.0 | | lane_properties | 0.0 | 0.0 | 0.0 | | lane_flag | 0.0 | 0.0 | 0.0 | | curb_type | 0.0 | 0.0 | 0.0 | +-----------------+-----------+--------+----------+

+----------+-----+------+-----------+--------+-------+ | class | gts | dets | precision | recall | ap | +----------+-----+------+-----------+--------+-------+ | lane | 22 | 58 | 0.000 | 0.000 | 0.000 | | curb | 18 | 25 | 0.000 | 0.000 | 0.000 | | stopline | 0 | 0 | 0.000 | 0.000 | 0.000 | +----------+-----+------+-----------+--------+-------+ | mAP | | | | | 0.000 | +----------+-----+------+-----------+--------+-------+

all 0

zyc10ud commented 1 year ago

It looks like there are some bugs in the training code, and I will fix them as soon as possible. Please be patient.

zyc10ud commented 1 year ago

I think the reason all the metrics obtained from the SD dataset evaluation are zero is because the dataset is too small, which results in the model lacking strong generalization capability. I recommend training on the icurb dataset or your own dataset instead.@363546178 @EchoQiHeng

363546178 commented 1 year ago

Hello, when I was using the icurb data set for training, I found that coredump appeared in lines 164 and 165 in projects/mmdet3d_plugin/datasets/icurb_dataset.py. After these two lines of code are blocked, the training can continue normally.

` L164 # multi_shifts_pts_tensor[:,:,0] /= self.max_x # normalize L165 # multi_shifts_pts_tensor[:,:,1] /= self.max_y

        if shifts_num > final_shift_num:
            index = np.random.choice(multi_shifts_pts.shape[0], final_shift_num, replace=False)
            multi_shifts_pts = multi_shifts_pts[index]

        multi_shifts_pts_tensor = to_tensor(multi_shifts_pts)
        multi_shifts_pts_tensor = multi_shifts_pts_tensor.to(
                        dtype=torch.float32)

        multi_shifts_pts_tensor[:,:,0] /= self.max_x # normalize
        multi_shifts_pts_tensor[:,:,1] /= self.max_y

`

zyc10ud commented 1 year ago

Thank you for pointing out the problem! And I have fixed the bug.

hustvl / VMA

KeyError: Caught KeyError in DataLoader worker process 0. #3

======================================= Root Cause: [0]: time: 2023-09-11_16:55:52 rank: 0 (local_rank: 0) exitcode: 1 (pid: 82570) error_file: <N/A> msg: "Process failed with exitcode 1"

learning policy

2023-09-12 16:11:31,322 - mmdet - INFO - Environment info:

TorchVision: 0.10.1+cu111 OpenCV: 4.8.0 MMCV: 1.4.0 MMCV Compiler: GCC 9.5 MMCV CUDA Compiler: 11.1 MMDetection: 2.14.0 MMSegmentation: 0.14.1 MMDetection3D: 0.17.2+607562c