Changing to Multi-process DistributedDataParallel

NanoCode012 commented 4 years ago

Hello, @glenn-jocher

From your advice, Multi-process DistributedDataParallel should be better than the Single process we have now.

I have been trying to change it here at my fork on my ddp branch. (Apologize that it's messy). However, I've reached into many issues.

Since it's in testing, I didn't account for the device being cpu as of now.

What I did so far

Added setup method to init_process_group and set torch.cuda device
Called torch.multiprocessing.spawn on the modified train function
~~Created a new argument called world_size to be called when running script (we can change this to counting # of device later)~~
Added condition checks for only 1 process to download weights file, remove the batch.jpg, saving checkpoints
~Added dist.barrier() while waiting for first process to do its job~
~Replaced all .to(device) to .to(rank) for each process.~
~Changed map_location for loading weights.~
~~Added more parameters to train function because the processes cannot see the global variables~~
Added DistributedSampler for multiple gpu on dataset so they each get a different sample
~Turned off tensorboard as I needed to pass tb_writer to train as argument to be able to use it.~

Things to fix

~~Do not divide dataset for validation set in create_dataloader~~
~~Reduce the need to call world_size as argument to say that we want multiprocess~~
~Cleaning up~
~~Fixing the inconsistent output prints (All process printing at once makes it hard to track)~~
~Enable tensorboard again~
~~Splitting batch_size/learning rate/epoch for multiple GPU~~
~~Figure out why global variables are always recalled (I disabled print(hyp) because of this)~~

Problems

Since I am still learning, it is very likely I messed up the training. The learnt information from each epoch for each process may not be distributed among themselves because I tested training at it's 0 mAP. I read somewhere that a SyncBatch layer may be needed, but I am not sure how to add it.

Saving checkpoints is done by only the first process as multiprocess saving concurrently causes problem for strip_optimizer later on. I am not sure if this is the correct way.

I am testing it, and it is much slower than using just one GPU, but I figured that if the training is fixed, it can be a good boost for multi-gpu.

I also understand that this isn't as high in your priority list but maybe some guidance would be nice. Thank you.

glenn-jocher commented 4 years ago

@NanoCode012 ah yes, you are correct! This is the latest. The typical use case is you drop this into a single command, i.e. put this in unit_tests.sh and then bash unit_tests.sh, and then an exit code of 0 is passing.

I'm trying to automate all of this as part of the CI pipeline, using github actions for example, but for now its a bit of a manual nightmare, we just run this in colab as often as possible to make sure recent commits haven't broken anything on single-gpu or cpu.

git clone https://github.com/ultralytics/yolov5
cd yolov5
pip install -qr requirements.txt onnx
python3 -c "from utils.google_utils import *; gdrive_download('1n_oKgR81BJtqk75b00eAjdv03qVCQn2f', 'coco128.zip')" && mv ./coco128 ../

export PYTHONPATH="$PWD" # to run *.py. files in subdirectories
for x in yolov5s #yolov5m yolov5l yolov5x # models
do
  python train.py --weights $x.pt --cfg $x.yaml --epochs 3 --img 320 --device 0,1  # train
  for di in 0,1 0 cpu # inference devices
  do
    python detect.py --weights $x.pt --device $di  # detect official
    python detect.py --weights runs/exp0/weights/last.pt --device $di  # detect custom
    python test.py --weights $x.pt --device $di # test official
    python test.py --weights runs/exp0/weights/last.pt --device $di # test custom
  done
  python models/yolo.py --cfg $x.yaml # inspect
  python models/export.py --weights $x.pt --img 640 --batch 1 # export
done

NanoCode012 commented 4 years ago

While running the Default code for 4 GPU, I am quite confused why only the first GPU is using a large amount of memory.

Furthermore, the speed of 4 GPU vs 1 GPU is kind of similar. I set both to run at the same time, and they are in the same epoch (34). It could be my CPU bottleneck though.

MagicFrogSJTU commented 4 years ago

While running the Default code for 4 GPU, I am quite confused why only the first GPU is using a large amount of memory.

Furthermore, the speed of 4 GPU vs 1 GPU is kind of similar. I set both to run at the same time, and they are in the same epoch (34). It could be my CPU bottleneck though.

The 4-gpu DP should be faster. GPU0 using more memory in DP is normal

NanoCode012 commented 4 years ago

Fixed in #401

ultralytics / yolov5

Changing to Multi-process DistributedDataParallel #264