Closed Williamlizl closed 2 years ago
Hi, this seems like a torch-distributed error during training, did you use single-GPU or multi-GPUs? More details are needed, also please make sure you have installed conda environment successfully and update mmdet version to the latest.
Hi, this seems like a torch-distributed error during training, did you use single-GPU or multi-GPUs? More details are needed, also please make sure you have installed conda environment successfully and update mmdet version to the latest.
I use distributed-train command, and runs well, but after one day it suddenly stopped and shows the above error. Then I run the command again, it shows the same error.
(openmmlab) lbc@prust-System-3:~/mmdetection-master$ CUDA_VISIBLE_DEVICES=0,1,2,3 ./tools/dist_train.sh configs/deformable_detr/deformable_detr_twostage_refine_r50_16x2_50e_coco.py 4 --resume-from ./work_dirs/deformable_detr_twostage_refine_r50_16x2_50e_coco/latest.pth
Maybe you can re-install your conda environment. Also, please check whether your computer lost some important CUDA file.
libpng error: IDAT: CRC error Traceback (most recent call last): File "./tools/train.py", line 189, in <module> main() File "./tools/train.py", line 178, in main train_detector( File "/home/lbc/mmdetection-master/mmdet/apis/train.py", line 211, in train_detector runner.run(data_loaders, cfg.workflow) File "/home/lbc/.local/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run epoch_runner(data_loaders[i], **kwargs) File "/home/lbc/.local/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 47, in train for i, data_batch in enumerate(self.data_loader): File "/home/lbc/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in __next__ data = self._next_data() File "/home/lbc/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1085, in _next_data return self._process_data(data) File "/home/lbc/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1111, in _process_data data.reraise() File "/home/lbc/.local/lib/python3.8/site-packages/torch/_utils.py", line 428, in reraise raise self.exc_type(msg) AttributeError: Caught AttributeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/lbc/.local/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop data = fetcher.fetch(index) File "/home/lbc/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/lbc/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp> data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/lbc/mmdetection-master/mmdet/datasets/custom.py", line 195, in __getitem__ data = self.prepare_train_img(idx) File "/home/lbc/mmdetection-master/mmdet/datasets/custom.py", line 218, in prepare_train_img return self.pipeline(results) File "/home/lbc/mmdetection-master/mmdet/datasets/pipelines/compose.py", line 41, in __call__ data = t(data) File "/home/lbc/mmdetection-master/mmdet/datasets/pipelines/loading.py", line 73, in __call__ results['img_shape'] = img.shape AttributeError: 'NoneType' object has no attribute 'shape'
I use other requirements, pip install re,,,.txt, it could work. But I really don't know why.
I met the same problem, it showed "AttributeError: 'NoneType' object has no attribute 'astype'",but the first 100 epoch is normal, have you solved this?
I met the same problem, it showed "AttributeError: 'NoneType' object has no attribute 'astype'",but the first 100 epoch is normal, have you solved this?
I use other requirements, pip install re,,,.txt, it could work. But I really don't know why.
`Traceback (most recent call last): File "./tools/train.py", line 189, in
main()
File "./tools/train.py", line 121, in main
init_dist(args.launcher, cfg.dist_params)
File "/home/lbc/.local/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 18, in init_dist
_init_dist_pytorch(backend, kwargs)
File "/home/lbc/.local/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 32, in _init_dist_pytorch
dist.init_process_group(backend=backend, kwargs)
File "/home/lbc/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
barrier()
File "/home/lbc/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
work = _default_pg.barrier()
RuntimeError: CUDA error: unknown error
Traceback (most recent call last):
File "./tools/train.py", line 189, in
main()
File "./tools/train.py", line 121, in main
init_dist(args.launcher, cfg.dist_params)
File "/home/lbc/.local/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 18, in init_dist
_init_dist_pytorch(backend, kwargs)
File "/home/lbc/.local/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 32, in _init_dist_pytorch
dist.init_process_group(backend=backend, kwargs)
File "/home/lbc/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
barrier()
File "/home/lbc/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
work = _default_pg.barrier()
RuntimeError: CUDA error: unknown error
Traceback (most recent call last):
File "./tools/train.py", line 189, in
main()
File "./tools/train.py", line 121, in main
init_dist(args.launcher, cfg.dist_params)
File "/home/lbc/.local/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 18, in init_dist
_init_dist_pytorch(backend, kwargs)
File "/home/lbc/.local/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 32, in _init_dist_pytorch
dist.init_process_group(backend=backend, kwargs)
File "/home/lbc/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
barrier()
File "/home/lbc/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
work = _default_pg.barrier()
RuntimeError: CUDA error: unknown error
Traceback (most recent call last):
File "./tools/train.py", line 189, in
main()
File "./tools/train.py", line 121, in main
init_dist(args.launcher, cfg.dist_params)
File "/home/lbc/.local/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 18, in init_dist
_init_dist_pytorch(backend, kwargs)
File "/home/lbc/.local/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 32, in _init_dist_pytorch
dist.init_process_group(backend=backend, kwargs)
File "/home/lbc/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
barrier()
File "/home/lbc/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
work = _default_pg.barrier()
RuntimeError: CUDA error: unknown error
Traceback (most recent call last):
File "/home/lbc/yes/envs/openmmlab/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/lbc/yes/envs/openmmlab/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/lbc/.local/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in
main()
File "/home/lbc/.local/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/lbc/yes/envs/openmmlab/bin/python3.7', '-u', './tools/train.py', '--local_rank=3', 'configs/deformable_detr/deformable_detr_twostage_refine_r50_16x2_50e_coco.py', '--launcher', 'pytorch', '--resume-from', './work_dirs/deformable_detr_twostage_refine_r50_16x2_50e_coco/latest.pth']' returned non-zero exit status 1.