Closed 363546178 closed 12 months ago
Thank you for pointing out the issue in our code. I have now fixed the problem, and you can pull the latest code and give it a try.
Hello! I am running the updated code here, and there is no crash problem like before. However, there seems to be a problem with the training output results. The evaluation results are all 0. What may be the reason for this? I am using a single card RTX3090 for training. The config change, system information and log are as follows:
config change:
data = dict( samples_per_gpu=1, #2 workers_per_gpu=1, #8
optimizer = dict( type='AdamW', lr=1.225e-5, paramwise_cfg=dict( custom_keys={ 'img_backbone': dict(lr_mult=0.1), }), weight_decay=0.005) #0.01
lr_config = dict( policy='CosineAnnealing', warmup='linear', warmup_iters=1000, #500 warmup_ratio=1.0 / 3, min_lr_ratio=1e-3)
system information:
sys.platform: linux Python: 3.8.17 (default, Jul 5 2023, 21:04:15) [GCC 11.2.0] CUDA available: True GPU 0: NVIDIA GeForce RTX 3090 CUDA_HOME: /usr/local/cuda NVCC: Build cuda_11.1.TC455_06.29069683_0 GCC: gcc (Ubuntu 9.5.0-1ubuntu1~22.04) 9.5.0 PyTorch: 1.9.1+cu111 PyTorch compiling details: PyTorch built with:
the log is this:
Cls data formatting done in 1.046210s!! with /home/pc01/code/VMA/val/work_dirs/vma_res152_e80_line/Tue_Sep_12_16_11_37_2023/pts_bbox/cls_formatted.pkl ----------use metric:chamfer---------- ----------threshhold:6---------- cls:lane done in 0.963024s!! cls:curb done in 0.016371s!! cls:stopline done in 0.001682s!!
+-----------------+-----------+--------+----------+ | class | precision | recall | f1_score | +-----------------+-----------+--------+----------+ | lane_direction | 0.0 | 0.0 | 0.0 | | lane_type | 0.0 | 0.0 | 0.0 | | lane_properties | 0.0 | 0.0 | 0.0 | | lane_flag | 0.0 | 0.0 | 0.0 | | curb_type | 0.0 | 0.0 | 0.0 | +-----------------+-----------+--------+----------+
+----------+-----+------+-----------+--------+-------+ | class | gts | dets | precision | recall | ap | +----------+-----+------+-----------+--------+-------+ | lane | 22 | 41 | 0.000 | 0.000 | 0.000 | | curb | 18 | 51 | 0.000 | 0.000 | 0.000 | | stopline | 0 | 0 | 0.000 | 0.000 | 0.000 | +----------+-----+------+-----------+--------+-------+ | mAP | | | | | 0.000 | +----------+-----+------+-----------+--------+-------+ ----------threshhold:15---------- cls:lane done in 0.964571s!! cls:curb done in 0.016190s!! cls:stopline done in 0.001796s!!
+-----------------+-----------+--------+----------+ | class | precision | recall | f1_score | +-----------------+-----------+--------+----------+ | lane_direction | 0.0 | 0.0 | 0.0 | | lane_type | 0.0 | 0.0 | 0.0 | | lane_properties | 0.0 | 0.0 | 0.0 | | lane_flag | 0.0 | 0.0 | 0.0 | | curb_type | 0.0 | 0.0 | 0.0 | +-----------------+-----------+--------+----------+
+----------+-----+------+-----------+--------+-------+ | class | gts | dets | precision | recall | ap | +----------+-----+------+-----------+--------+-------+ | lane | 22 | 41 | 0.000 | 0.000 | 0.000 | | curb | 18 | 51 | 0.000 | 0.000 | 0.000 | | stopline | 0 | 0 | 0.000 | 0.000 | 0.000 | +----------+-----+------+-----------+--------+-------+ | mAP | | | | | 0.000 | +----------+-----+------+-----------+--------+-------+ ----------threshhold:30---------- cls:lane done in 0.962664s!! cls:curb done in 0.017725s!! cls:stopline done in 0.001489s!!
+-----------------+-----------+--------+----------+ | class | precision | recall | f1_score | +-----------------+-----------+--------+----------+ | lane_direction | 0.0 | 0.0 | 0.0 | | lane_type | 0.0 | 0.0 | 0.0 | | lane_properties | 0.0 | 0.0 | 0.0 | | lane_flag | 0.0 | 0.0 | 0.0 | | curb_type | 0.0 | 0.0 | 0.0 | +-----------------+-----------+--------+----------+
+----------+-----+------+-----------+--------+-------+ | class | gts | dets | precision | recall | ap | +----------+-----+------+-----------+--------+-------+ | lane | 22 | 41 | 0.000 | 0.000 | 0.000 | | curb | 18 | 51 | 0.000 | 0.000 | 0.000 | | stopline | 0 | 0 | 0.000 | 0.000 | 0.000 | +----------+-----+------+-----------+--------+-------+ | mAP | | | | | 0.000 | +----------+-----+------+-----------+--------+-------+ lane: 0.0 curb: 0.0 stopline: 0.0 map: 0.0 2023-09-12 16:36:08,917 - mmdet - INFO - Exp name: vma_res152_e80_line.py 2023-09-12 16:36:08,917 - mmdet - INFO - Epoch(val) [80][7] SD_Line_Map_chamfer/lane_AP: 0.0000, SD_Line_Map_chamfer/curb_AP: 0.0000, SD_Line_Map_chamfer/stopline_AP: 0.0000, SD_Line_Map_chamfer/mAP: 0.0000, SD_Line_Map_chamfer/lane_AP_thr_6: 0.0000, SD_Line_Map_chamfer/lane_AP_thr_15: 0.0000, SD_Line_Map_chamfer/lane_AP_thr_30: 0.0000, SD_Line_Map_chamfer/curb_AP_thr_6: 0.0000, SD_Line_Map_chamfer/curb_AP_thr_15: 0.0000, SD_Line_Map_chamfer/curb_AP_thr_30: 0.0000, SD_Line_Map_chamfer/stopline_AP_thr_6: 0.0000, SD_Line_Map_chamfer/stopline_AP_thr_15: 0.0000, SD_Line_Map_chamfer/stopline_AP_thr_30: 0.0000
I have also encountered this situation. Have you resolved it.
+----------+-----+------+-----------+--------+-------+ | class | gts | dets | precision | recall | ap | +----------+-----+------+-----------+--------+-------+ | lane | 22 | 58 | 0.000 | 0.000 | 0.000 | | curb | 18 | 25 | 0.000 | 0.000 | 0.000 | | stopline | 0 | 0 | 0.000 | 0.000 | 0.000 | +----------+-----+------+-----------+--------+-------+ | mAP | | | | | 0.000 | +----------+-----+------+-----------+--------+-------+
all 0
It looks like there are some bugs in the training code, and I will fix them as soon as possible. Please be patient.
I think the reason all the metrics obtained from the SD dataset evaluation are zero is because the dataset is too small, which results in the model lacking strong generalization capability. I recommend training on the icurb dataset or your own dataset instead.@363546178 @EchoQiHeng
Hello, when I was using the icurb data set for training, I found that coredump appeared in lines 164 and 165 in projects/mmdet3d_plugin/datasets/icurb_dataset.py. After these two lines of code are blocked, the training can continue normally.
` L164 # multi_shifts_pts_tensor[:,:,0] /= self.max_x # normalize L165 # multi_shifts_pts_tensor[:,:,1] /= self.max_y
if shifts_num > final_shift_num:
index = np.random.choice(multi_shifts_pts.shape[0], final_shift_num, replace=False)
multi_shifts_pts = multi_shifts_pts[index]
multi_shifts_pts_tensor = to_tensor(multi_shifts_pts)
multi_shifts_pts_tensor = multi_shifts_pts_tensor.to(
dtype=torch.float32)
multi_shifts_pts_tensor[:,:,0] /= self.max_x # normalize
multi_shifts_pts_tensor[:,:,1] /= self.max_y
`
Thank you for pointing out the problem! And I have fixed the bug.
Hello! I use the SD dataset to train the line model, I prepare the dataset preparation command is this:
./tools/dist_train.sh ./projects/configs/vma_res152_e80_line.py 1
the error log is this:
KeyError: Caught KeyError in DataLoader worker process 0
I tried to regenerate the SD line data set according to the method of docs/prepare_dataset.md, but the problem still occurred. How can I troubleshoot the cause of this problem? Thank you!
all error log is this: 2023-09-11 16:55:43,906 - mmdet - INFO - Saving checkpoint at 5 epochs [ ] 0/7, elapsed: 0s, ETA:Traceback (most recent call last): File "./tools/train.py", line 261, in
main()
File "./tools/train.py", line 250, in main
custom_train_model(
File "/home/pc01/code/VMA/projects/mmdet3d_plugin/bevformer/apis/train.py", line 27, in custom_train_model
custom_train_detector(
File "/home/pc01/code/VMA/projects/mmdet3d_plugin/bevformer/apis/mmdet_train.py", line 212, in custom_train_detector
runner.run(data_loaders, cfg.workflow)
File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 54, in train
self.call_hook('after_train_epoch')
File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
getattr(hook, fn_name)(self)
File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/mmcv/runner/hooks/evaluation.py", line 267, in after_train_epoch
self._do_evaluate(runner)
File "/home/pc01/code/VMA/projects/mmdet3d_plugin/core/evaluation/eval_hooks.py", line 78, in _do_evaluate
results = custom_multi_gpu_test(
File "/home/pc01/code/VMA/projects/mmdet3d_plugin/bevformer/apis/test.py", line 71, in custom_multi_gpu_test
for i, data in enumerate(data_loader):
File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in next
data = self._next_data()
File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
return self._process_data(data)
File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
data.reraise()
File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/_utils.py", line 425, in reraise
raise self.exc_type(msg)
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/home/pc01/code/VMA/projects/mmdet3d_plugin/datasets/builder.py", line 173, in iCurb_collate
data['seq'] = [x[0] for x in batch]
File "/home/pc01/code/VMA/projects/mmdet3d_plugin/datasets/builder.py", line 173, in
data['seq'] = [x[0] for x in batch]
KeyError: 0
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 82570) of binary: /home/pc01/anaconda3/envs/vma/bin/python Traceback (most recent call last): File "/home/pc01/anaconda3/envs/vma/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/pc01/anaconda3/envs/vma/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
elastic_launch(
File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================= Root Cause: [0]: time: 2023-09-11_16:55:52 rank: 0 (local_rank: 0) exitcode: 1 (pid: 82570) error_file: <N/A> msg: "Process failed with exitcode 1"
Other Failures: