hnuzhy / SSDA-YOLO

Codes for my paper "SSDA-YOLO: Semi-supervised Domain Adaptive YOLO for Cross-Domain Object Detection"
134 stars 17 forks source link

i only have two gpu, i set device 0,1 when train the model,i face the error that is insufficient CUDA devices for DDP command, how to solve it #5

Closed liuhaolinwen closed 1 year ago

liuhaolinwen commented 1 year ago

(base) liuhaolin@ubuntu18:/sdb/liuhaolin/SSDA-YOLO$ python -m torch.distributed.launch --nproc_per_node 4 ssda_yolov5_train.py --weights weights/yolov5l.pt --data yamls_sda/pascalvoc0712_clipart1k_VOC.yaml --name voc2clipart_ssda_960_yolov5l --img 960 --device 0,1 --batch-size 24 --epochs 200 --lambda_weight 0.005 --consistency_loss --alpha_weight 2.0 /sdb/anaconda/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions warnings.warn( WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


LOCAL_RANK 1 cuda 2 1 LOCAL_RANK 2 cuda 2 2 Traceback (most recent call last): File "ssda_yolov5_train.py", line 833, in main(opt) File "ssda_yolov5_train.py", line 816, in main assert torch.cuda.device_count() > LOCAL_RANK, 'insufficient CUDA devices for DDP command' AssertionError: insufficient CUDA devices for DDP command LOCAL_RANK 3 cuda 2 3 Traceback (most recent call last): File "ssda_yolov5_train.py", line 833, in main(opt) File "ssda_yolov5_train.py", line 816, in main assert torch.cuda.device_count() > LOCAL_RANK, 'insufficient CUDA devices for DDP command' AssertionError: insufficient CUDA devices for DDP command LOCAL_RANK 0 train: weights=weights/yolov5l.pt, cfg=, data=yamls_sda/pascalvoc0712_clipart1k_VOC.yaml, hyp=data/hyps/hyp.scratch.yaml, epochs=200, batch_size=24, img_size=[960], rect=False, resume=False, nosave=False, notest=False, noautoanchor=False, evolve=False, bucket=, cache_images=False, image_weights=False, device=0,1, multi_scale=False, single_cls=False, adam=False, sync_bn=False, workers=8, project=runs/train, entity=None, name=voc2clipart_ssda_960_yolov5l, exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, upload_dataset=False, bbox_interval=-1, save_period=-1, artifact_alias=latest, local_rank=0, teacher_alpha=0.99, conf_thres=0.5, iou_thres=0.3, max_gt_boxes=20, lambda_weight=0.005, consistency_loss=True, alpha_weight=2.0, student_weight=None, teacher_weight=None, save_dir=None github: ⚠️ WARNING: code is out of date by 1 commit. Use 'git pull' to update or 'git clone https://github.com/hnuzhy/SSDA-YOLO' to download latest. YOLOv5 🚀 57d8bc3 torch 1.12.1+cu102 CUDA:0 (Tesla V100S-PCIE-32GB, 32510.5MB) CUDA:1 (Tesla V100S-PCIE-32GB, 32510.5MB) Added key: store_based_barrier_key:1 to store for rank: 0 WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 79246 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 79247 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 2 (pid: 79248) of binary: /sdb/anaconda/bin/python Traceback (most recent call last): File "/sdb/anaconda/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/sdb/anaconda/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/sdb/anaconda/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/sdb/anaconda/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/sdb/anaconda/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch

File "/sdb/anaconda/lib/python3.8/site-packages/torch/distribute============================================================ ssda_yolov5_train.py FAILED

Failures: [1]: time : 2022-11-29_12:17:00 host : ubuntu18 rank : 3 (local_rank: 3) exitcode : 1 (pid: 79249) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2022-11-29_12:17:00 host : ubuntu18 rank : 2 (local_rank: 2) exitcode : 1 (pid: 79248) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html d/run.py", line 752, in run elastic_launch( File "/sdb/anaconda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/sdb/anaconda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

hnuzhy commented 1 year ago

You may change --nproc_per_node 4 into 2, which indicates the number of GPU cards.