open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.15k stars 9.39k forks source link

subprocess.CalledProcessError: Command,,,,,,,,, returned non-zero exit status 1 #6595

Closed Williamlizl closed 2 years ago

Williamlizl commented 2 years ago

`Traceback (most recent call last): File "./tools/train.py", line 189, in main() File "./tools/train.py", line 121, in main init_dist(args.launcher, cfg.dist_params) File "/home/lbc/.local/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 18, in init_dist _init_dist_pytorch(backend, kwargs) File "/home/lbc/.local/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 32, in _init_dist_pytorch dist.init_process_group(backend=backend, kwargs) File "/home/lbc/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group barrier() File "/home/lbc/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier work = _default_pg.barrier() RuntimeError: CUDA error: unknown error Traceback (most recent call last): File "./tools/train.py", line 189, in main() File "./tools/train.py", line 121, in main init_dist(args.launcher, cfg.dist_params) File "/home/lbc/.local/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 18, in init_dist _init_dist_pytorch(backend, kwargs) File "/home/lbc/.local/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 32, in _init_dist_pytorch dist.init_process_group(backend=backend, kwargs) File "/home/lbc/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group barrier() File "/home/lbc/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier work = _default_pg.barrier() RuntimeError: CUDA error: unknown error Traceback (most recent call last): File "./tools/train.py", line 189, in main() File "./tools/train.py", line 121, in main init_dist(args.launcher, cfg.dist_params) File "/home/lbc/.local/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 18, in init_dist _init_dist_pytorch(backend, kwargs) File "/home/lbc/.local/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 32, in _init_dist_pytorch dist.init_process_group(backend=backend, kwargs) File "/home/lbc/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group barrier() File "/home/lbc/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier work = _default_pg.barrier() RuntimeError: CUDA error: unknown error Traceback (most recent call last): File "./tools/train.py", line 189, in main() File "./tools/train.py", line 121, in main init_dist(args.launcher, cfg.dist_params) File "/home/lbc/.local/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 18, in init_dist _init_dist_pytorch(backend, kwargs) File "/home/lbc/.local/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 32, in _init_dist_pytorch dist.init_process_group(backend=backend, kwargs) File "/home/lbc/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group barrier() File "/home/lbc/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier work = _default_pg.barrier() RuntimeError: CUDA error: unknown error Traceback (most recent call last): File "/home/lbc/yes/envs/openmmlab/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/lbc/yes/envs/openmmlab/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/lbc/.local/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in main() File "/home/lbc/.local/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main cmd=cmd) subprocess.CalledProcessError: Command '['/home/lbc/yes/envs/openmmlab/bin/python3.7', '-u', './tools/train.py', '--local_rank=3', 'configs/deformable_detr/deformable_detr_twostage_refine_r50_16x2_50e_coco.py', '--launcher', 'pytorch', '--resume-from', './work_dirs/deformable_detr_twostage_refine_r50_16x2_50e_coco/latest.pth']' returned non-zero exit status 1.

BIGWangYuDong commented 2 years ago

Hi, this seems like a torch-distributed error during training, did you use single-GPU or multi-GPUs? More details are needed, also please make sure you have installed conda environment successfully and update mmdet version to the latest.

Williamlizl commented 2 years ago

Hi, this seems like a torch-distributed error during training, did you use single-GPU or multi-GPUs? More details are needed, also please make sure you have installed conda environment successfully and update mmdet version to the latest.

I use distributed-train command, and runs well, but after one day it suddenly stopped and shows the above error. Then I run the command again, it shows the same error. (openmmlab) lbc@prust-System-3:~/mmdetection-master$ CUDA_VISIBLE_DEVICES=0,1,2,3 ./tools/dist_train.sh configs/deformable_detr/deformable_detr_twostage_refine_r50_16x2_50e_coco.py 4 --resume-from ./work_dirs/deformable_detr_twostage_refine_r50_16x2_50e_coco/latest.pth

BIGWangYuDong commented 2 years ago

Maybe you can re-install your conda environment. Also, please check whether your computer lost some important CUDA file.

Williamlizl commented 2 years ago

libpng error: IDAT: CRC error Traceback (most recent call last): File "./tools/train.py", line 189, in <module> main() File "./tools/train.py", line 178, in main train_detector( File "/home/lbc/mmdetection-master/mmdet/apis/train.py", line 211, in train_detector runner.run(data_loaders, cfg.workflow) File "/home/lbc/.local/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run epoch_runner(data_loaders[i], **kwargs) File "/home/lbc/.local/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 47, in train for i, data_batch in enumerate(self.data_loader): File "/home/lbc/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in __next__ data = self._next_data() File "/home/lbc/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1085, in _next_data return self._process_data(data) File "/home/lbc/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1111, in _process_data data.reraise() File "/home/lbc/.local/lib/python3.8/site-packages/torch/_utils.py", line 428, in reraise raise self.exc_type(msg) AttributeError: Caught AttributeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/lbc/.local/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop data = fetcher.fetch(index) File "/home/lbc/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/lbc/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp> data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/lbc/mmdetection-master/mmdet/datasets/custom.py", line 195, in __getitem__ data = self.prepare_train_img(idx) File "/home/lbc/mmdetection-master/mmdet/datasets/custom.py", line 218, in prepare_train_img return self.pipeline(results) File "/home/lbc/mmdetection-master/mmdet/datasets/pipelines/compose.py", line 41, in __call__ data = t(data) File "/home/lbc/mmdetection-master/mmdet/datasets/pipelines/loading.py", line 73, in __call__ results['img_shape'] = img.shape AttributeError: 'NoneType' object has no attribute 'shape' I use other requirements, pip install re,,,.txt, it could work. But I really don't know why.

jewelc92 commented 2 years ago

I met the same problem, it showed "AttributeError: 'NoneType' object has no attribute 'astype'",but the first 100 epoch is normal, have you solved this?

Williamlizl commented 2 years ago

I met the same problem, it showed "AttributeError: 'NoneType' object has no attribute 'astype'",but the first 100 epoch is normal, have you solved this?

I use other requirements, pip install re,,,.txt, it could work. But I really don't know why.