ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
50.37k stars 16.26k forks source link

Multiple GPU support #48

Closed HaxThePlanet closed 4 years ago

HaxThePlanet commented 4 years ago

🚀 Feature

Multiple GPU support

Motivation

Increased performance!

Pitch

I just bought a 3-way p100 box, come on please :)

Alternatives

Google Compute TPU support?

Additional context

github-actions[bot] commented 4 years ago

Hello @HaxThePlanet, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Jupyter Notebook Open In Colab, Docker Image, and Google Cloud Quickstart Guide for example environments.

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom model or data training question, please note that Ultralytics does not provide free personal support. As a leader in vision ML and AI, we do offer professional consulting, from simple expert advice up to delivery of fully customized, end-to-end production solutions for our clients, such as:

For more information please visit https://www.ultralytics.com.

glenn-jocher commented 4 years ago

@HaxThePlanet good news: yolov5 supports multi-gpu out of the box. Some examples:

python train.py  # will use ALL available cuda resources found on system
python train.py --device 0,1  # specify devices
python train.py --device 0  # specify 1 device 
python train.py --device cpu  # force cpu usage

test.py works exactly the same way. detect.py accepts a --device argument, but is limited to 1 gpu.

HaxThePlanet commented 4 years ago

Excellent, thanks for the fast response and hard work. This thing is amazing!

AIFAN-Lab commented 4 years ago

when I type the command: python train.py --data coco.yaml --cfg yolov5s.yaml --weights '' --batch-size 16 then, it will show below: {'lr0': 0.01, 'momentum': 0.937, 'weight_decay': 0.0005, 'giou': 0.05, 'cls': 0.58, 'cls_pw': 1.0, 'obj': 1.0, 'obj_pw': 1.0, 'iou_t': 0.2, 'anchor_t': 4.0, 'fl_gamma': 0.0, 'hsv_h': 0.014, 'hsv_s': 0.68, 'hsv_v': 0.36, 'degrees': 0.0, 'translate': 0.0, 'scale': 0.5, 'shear': 0.0} Namespace(adam=False, batch_size=16, bucket='', cache_images=False, cfg='./models/yolov5s.yaml', data='./data/coco.yaml', device='', epochs=300, evolve=False, img_size=[640, 640], multi_scale=False, name='', nosave=False, notest=False, rect=False, resume=False, single_cls=False, weights='') Using CUDA Apex device0 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=11019MB) device1 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=11019MB) device2 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=11019MB) device3 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=11019MB) device4 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=11019MB) device5 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=11019MB) device6 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=11019MB) device7 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=11019MB) Optimizer groups: 54 .bias, 60 conv.weight, 51 other

bug report as below: /share/home/xx/anaconda3/envs/pt1.5.0/lib/python3.7/site-packages/torch/nn/parallel/distributed.py:303: UserWarning: Single-Process Multi-GPU is not the recommended mode for DDP. In this mode, each DDP instance operates on multiple devices and creates multiple module replicas within one process. The overhead of scatter/gather and GIL contention in every forward pass can slow down training. Please consider using one DDP instance per device or per module replica by explicitly setting device_ids or CUDA_VISIBLE_DEVICES. NB: There is a known issue in nn.parallel.replicate that prevents a single DDP instance to operate on multiple model replicas. "Single-Process Multi-GPU is not the recommended mode for " Traceback (most recent call last): File "train.py", line 400, in train(hyp) File "train.py", line 152, in train model = torch.nn.parallel.DistributedDataParallel(model) File "/share/home/xx/anaconda3/envs/pt1.5.0/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 287, in init self._ddp_init_helper() File "/share/home/xx/anaconda3/envs/pt1.5.0/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 380, in _ddp_init_helper expect_sparse_gradient) RuntimeError: Model replicas must have an equal number of parameters.

glenn-jocher commented 4 years ago

@AIFAN-Lab thanks for the bug report. I tested on two GPUs today and everything worked well. Can you try to reproduce this in our docker image to see if it's an environment issue?

AIFAN-Lab commented 4 years ago

Ok. I will test the Docker. And report later.

HaxThePlanet commented 4 years ago

Is it still necessary to train the first 1000 or so iterations on a single GPU?

glenn-jocher commented 4 years ago

@HaxThePlanet that's never been necessary.

liangshi036 commented 3 years ago

@HaxThePlanet good news: yolov5 supports multi-gpu out of the box. Some examples:

python train.py  # will use ALL available cuda resources found on system
python train.py --device 0,1  # specify devices
python train.py --device 0  # specify 1 device 
python train.py --device cpu  # force cpu usage

test.py works exactly the same way. detect.py accepts a --device argument, but is limited to 1 gpu.

would you pls support multi-gpus while using detect.py ?

glenn-jocher commented 3 years ago

@liangshi036 we don't have the resources to implement suggestions, but you can do this yourself and submit a PR!