ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
51.2k stars 16.43k forks source link

Multi GPU RuntimeError: Model replicas must have an equal number of parameters. #11

Closed lhwcv closed 4 years ago

lhwcv commented 4 years ago

🐛 Bug

when using 4* 2080ti for training: "RuntimeError: Model replicas must have an equal number of parameters." (1 gpu is OK)

To Reproduce

REQUIRED: Code to reproduce your issue below

CUDA_VISIBLE_DEVICES=0,1,2,3 python  train.py --device 0,1,2,3  --data coco.yaml --cfg yolov3-spp.yaml  --weights '' --batch-size 64


## Expected behavior
It should be OK

## Environment
 - OS: [Ubuntu 18.04]
 - GPU [4* 2080 Ti]
 - packages:  match  requriments.txt 
github-actions[bot] commented 4 years ago

Hello @lhwcv, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Google Colab Notebook, Docker Image, and GCP Quickstart Guide for example environments.

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom model or data training question, please note that Ultralytics does not provide free personal support. As a leader in vision ML and AI, we do offer professional consulting, from simple expert advice up to delivery of fully customized, end-to-end production solutions for our clients, such as:

For more information please visit https://www.ultralytics.com.

lhwcv commented 4 years ago

It maybe pytorch==1.5 version problem, 1.4 ok. Closed!

glenn-jocher commented 4 years ago

@lhwcv I'm not able to reproduce your issue. I tried with our docker container (with pytorch 1.5), and training operates correctly with your command with 4 GPUs:

Screen Shot 2020-06-03 at 12 23 41 AM
glenn-jocher commented 4 years ago

Note: this may have been fixed by the fix applied for #15.

glenn-jocher commented 4 years ago

It maybe pytorch==1.5 version problem, 1.4 ok. Closed!

Closing as the original issue seems to be resolved.

lucasjinreal commented 4 years ago

Not yet, official pytorch 1.5 still got this issue:

/usr/local/lib/python3.6/dist-packages/torch/serialization.py:657: SourceChangeWarning: source code of class 'models.yolo.Model' has changed. you can retrieve the original source code by accessing the object's source attribute or set `torch.nn.Module.dump_patches = True` and use the patch tool to revert the changes.
  warnings.warn(msg, SourceChangeWarning)
/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py:303: UserWarning: Single-Process Multi-GPU is not the recommended mode for DDP. In this mode, each DDP instance operates on multiple devices and creates multiple module replicas within one process. The overhead of scatter/gather and GIL contention in every forward pass can slow down training. Please consider using one DDP instance per device or per module replica by explicitly setting device_ids or CUDA_VISIBLE_DEVICES. NB: There is a known issue in nn.parallel.replicate that prevents a single DDP instance to operate on multiple model replicas.
  "Single-Process Multi-GPU is not the recommended mode for "
Traceback (most recent call last):
  File "train.py", line 399, in <module>
    train(hyp)
  File "train.py", line 155, in train
    model = torch.nn.parallel.DistributedDataParallel(model)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 287, in __init__
    self._ddp_init_helper()
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 380, in _ddp_init_helper
    expect_sparse_gradient)
RuntimeError: Model replicas must have an equal number of parameters.
mingmmq commented 4 years ago

the same issue with custom dataset and using the pre-trained yolov5x.pt file

RuntimeError: Model replicas must have an equal number of parameters.
glenn-jocher commented 4 years ago

I've reopened as issue appears to still be present.

@mingmmq could you supply code to reproduce your issue? Is it reproducible on coco128.yaml dataset?

intgogo commented 4 years ago

I have the same problem in my custom dataset(24 classes).

tomjerrygithub commented 4 years ago

I have the same problem in my custom dataset(11 classes).

JierunChen commented 4 years ago

Try to downgrade the PyTorch from1.5 to 1.4. It works for me

Lornatang commented 4 years ago

run

pip install torch==1.4.0+cu100 torchvision==0.5.0+cu100 -f https://download.pytorch.org/whl/torch_stable.html

to fix Model replicas must have an equal number of parameters.

Or you see https://github.com/pytorch/pytorch/pull/36503. This bug was fixed in this issue, but you must manually build PyTorch==1.5+cu102

panchengl commented 4 years ago

torch1.5->1.4 is ok

glenn-jocher commented 4 years ago

@panchengl does the recently released 1.5.1 fix this?

github-actions[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.