ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
50.58k stars 16.31k forks source link

Changing to Multi-process DistributedDataParallel #264

Closed NanoCode012 closed 4 years ago

NanoCode012 commented 4 years ago

Hello, @glenn-jocher

From your advice, Multi-process DistributedDataParallel should be better than the Single process we have now.

I have been trying to change it here at my fork on my ddp branch. (Apologize that it's messy). However, I've reached into many issues.

Since it's in testing, I didn't account for the device being cpu as of now.

What I did so far

Things to fix

Problems

Since I am still learning, it is very likely I messed up the training. The learnt information from each epoch for each process may not be distributed among themselves because I tested training at it's 0 mAP. I read somewhere that a SyncBatch layer may be needed, but I am not sure how to add it.

Saving checkpoints is done by only the first process as multiprocess saving concurrently causes problem for strip_optimizer later on. I am not sure if this is the correct way.

I am testing it, and it is much slower than using just one GPU, but I figured that if the training is fixed, it can be a good boost for multi-gpu.

I also understand that this isn't as high in your priority list but maybe some guidance would be nice. Thank you.

glenn-jocher commented 4 years ago

@NanoCode012 ah yes, you are correct! This is the latest. The typical use case is you drop this into a single command, i.e. put this in unit_tests.sh and then bash unit_tests.sh, and then an exit code of 0 is passing.

I'm trying to automate all of this as part of the CI pipeline, using github actions for example, but for now its a bit of a manual nightmare, we just run this in colab as often as possible to make sure recent commits haven't broken anything on single-gpu or cpu.

git clone https://github.com/ultralytics/yolov5
cd yolov5
pip install -qr requirements.txt onnx
python3 -c "from utils.google_utils import *; gdrive_download('1n_oKgR81BJtqk75b00eAjdv03qVCQn2f', 'coco128.zip')" && mv ./coco128 ../

export PYTHONPATH="$PWD" # to run *.py. files in subdirectories
for x in yolov5s #yolov5m yolov5l yolov5x # models
do
  python train.py --weights $x.pt --cfg $x.yaml --epochs 3 --img 320 --device 0,1  # train
  for di in 0,1 0 cpu # inference devices
  do
    python detect.py --weights $x.pt --device $di  # detect official
    python detect.py --weights runs/exp0/weights/last.pt --device $di  # detect custom
    python test.py --weights $x.pt --device $di # test official
    python test.py --weights runs/exp0/weights/last.pt --device $di # test custom
  done
  python models/yolo.py --cfg $x.yaml # inspect
  python models/export.py --weights $x.pt --img 640 --batch 1 # export
done
NanoCode012 commented 4 years ago

While running the Default code for 4 GPU, I am quite confused why only the first GPU is using a large amount of memory. image

Furthermore, the speed of 4 GPU vs 1 GPU is kind of similar. I set both to run at the same time, and they are in the same epoch (34). It could be my CPU bottleneck though.

MagicFrogSJTU commented 4 years ago

While running the Default code for 4 GPU, I am quite confused why only the first GPU is using a large amount of memory. image

Furthermore, the speed of 4 GPU vs 1 GPU is kind of similar. I set both to run at the same time, and they are in the same epoch (34). It could be my CPU bottleneck though.

The 4-gpu DP should be faster. GPU0 using more memory in DP is normal

NanoCode012 commented 4 years ago

Fixed in #401