ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
50.76k stars 16.34k forks source link

when i was training with the latest yolov5,i got a error:barrier() got an unexpected keyword argument 'device_ids' #4480

Closed zhang5957 closed 3 years ago

zhang5957 commented 3 years ago

Traceback (most recent call last): File "train.py", line 602, in main(opt) File "train.py", line 500, in main train(opt.hyp, opt, device) File "train.py", line 98, in train with torch_distributed_zero_first(RANK): File "/home/zyc/miniconda3/envs/yolov5/lib/python3.6/contextlib.py", line 81, in enter return next(self.gen) File "/home/zyc/competition/leaves/yolov53/utils/torch_utils.py", line 38, in torch_distributed_zero_first dist.barrier(device_ids=[local_rank]) TypeError: barrier() got an unexpected keyword argument 'device_ids'

and this is my command: python -m torch.distributed.launch --nproc_per_node 2 train.py --weights /home/zyc/competition/leaves/yolov5/models/yolov5l.pt --epochs 100 --device 2,3 --linear-lr --name yolov5l_kaggle_ult --batch-size 16 --img-size 640 --label-smoothing 0.2

fosaken commented 3 years ago

I met the same error,too. dist.barrier() do not need keyword arg "device_ids", it needs two keyword args "group" and "async_op" in my torch version, my torch version is 1.7.1 and cuda10.1.

zhang5957 commented 3 years ago

Did you have solve it?

fosaken commented 3 years ago

Did you have solve it?

-_-||| not yet I'll try some other released versions of v5

Jacqueline121 commented 3 years ago

You just need to upgrade your PyTorch to 1.8.0 or a higher version. This worked for me.

fosaken commented 3 years ago

You just need to upgrade your PyTorch to 1.8.0 or a higher version. This worked for me.

ok thk u , but torch1.8.0 requires a higher cuda version than 10.1, I'll try

Jacqueline121 commented 3 years ago

Or, maybe you can modify the original code in the torch_utils.py. Just change the 'dist.barrier(device_ids=[0])' to 'dist.barrier()'. I'm not sure whether it will work, but you can have a try.

glenn-jocher commented 3 years ago

@zhang5957 @fosaken @Jacqueline121 I would recommend all DDP trainings run in our docker image, which should resolve this issue. In torch 1.9 dist.barrier() returns a warning (one warning per worker) if device_ids are not passed as an argument.

We could introduce version-specific code here, but we don't do this anywhere else in the repository, so I'm not sure if it's appropriate here. An alternative would be to run check_requirements('torch>=1.8.0', install=False) on DDP.

github-actions[bot] commented 3 years ago

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Access additional Ultralytics ⚡ resources:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

yustaub commented 2 years ago

hello, sir. how can I use DDP to train my own dataset in pytorch 1.7( and I have no power to use docker), because my machine can support pytorch1.7 at most. Can you give me some advice, very appreciate for your reply!!! @glenn-jocher

glenn-jocher commented 2 years ago

@yustaub Docker is recommended for DDP training. See Multi-GPU tutorial for details:

YOLOv5 Tutorials

Solchanrefqialhabib commented 12 months ago

Traceback (most recent call last): File "d:\KPI\main.py", line 14, in model = YOLO(weights='D:\KPI\dataset\yolov5s.pt') TypeError: init() got an unexpected keyword argument 'weights'

help

glenn-jocher commented 12 months ago

@Solchanrefqialhabib you need to specify weights without '='. Here's the correct way to initialize the YOLO model:

model = YOLO('D:\KPI\dataset\yolov5s.pt')