Closed zhang5957 closed 3 years ago
I met the same error,too. dist.barrier() do not need keyword arg "device_ids", it needs two keyword args "group" and "async_op" in my torch version, my torch version is 1.7.1 and cuda10.1.
Did you have solve it?
Did you have solve it?
-_-||| not yet I'll try some other released versions of v5
You just need to upgrade your PyTorch to 1.8.0 or a higher version. This worked for me.
You just need to upgrade your PyTorch to 1.8.0 or a higher version. This worked for me.
ok thk u , but torch1.8.0 requires a higher cuda version than 10.1, I'll try
Or, maybe you can modify the original code in the torch_utils.py. Just change the 'dist.barrier(device_ids=[0])' to 'dist.barrier()'. I'm not sure whether it will work, but you can have a try.
@zhang5957 @fosaken @Jacqueline121 I would recommend all DDP trainings run in our docker image, which should resolve this issue. In torch 1.9 dist.barrier()
returns a warning (one warning per worker) if device_ids are not passed as an argument.
We could introduce version-specific code here, but we don't do this anywhere else in the repository, so I'm not sure if it's appropriate here. An alternative would be to run check_requirements('torch>=1.8.0', install=False)
on DDP.
👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.
Access additional YOLOv5 🚀 resources:
Access additional Ultralytics ⚡ resources:
Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!
Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!
hello, sir. how can I use DDP to train my own dataset in pytorch 1.7( and I have no power to use docker), because my machine can support pytorch1.7 at most. Can you give me some advice, very appreciate for your reply!!! @glenn-jocher
@yustaub Docker is recommended for DDP training. See Multi-GPU tutorial for details:
Traceback (most recent call last):
File "d:\KPI\main.py", line 14, in
help
@Solchanrefqialhabib you need to specify weights
without '='. Here's the correct way to initialize the YOLO model:
model = YOLO('D:\KPI\dataset\yolov5s.pt')
Traceback (most recent call last): File "train.py", line 602, in
main(opt)
File "train.py", line 500, in main
train(opt.hyp, opt, device)
File "train.py", line 98, in train
with torch_distributed_zero_first(RANK):
File "/home/zyc/miniconda3/envs/yolov5/lib/python3.6/contextlib.py", line 81, in enter
return next(self.gen)
File "/home/zyc/competition/leaves/yolov53/utils/torch_utils.py", line 38, in torch_distributed_zero_first
dist.barrier(device_ids=[local_rank])
TypeError: barrier() got an unexpected keyword argument 'device_ids'
and this is my command: python -m torch.distributed.launch --nproc_per_node 2 train.py --weights /home/zyc/competition/leaves/yolov5/models/yolov5l.pt --epochs 100 --device 2,3 --linear-lr --name yolov5l_kaggle_ult --batch-size 16 --img-size 640 --label-smoothing 0.2