Closed NanoCode012 closed 4 years ago
@NanoCode012 ah yes, you are correct! This is the latest. The typical use case is you drop this into a single command, i.e. put this in unit_tests.sh and then bash unit_tests.sh
, and then an exit code of 0 is passing.
I'm trying to automate all of this as part of the CI pipeline, using github actions for example, but for now its a bit of a manual nightmare, we just run this in colab as often as possible to make sure recent commits haven't broken anything on single-gpu or cpu.
git clone https://github.com/ultralytics/yolov5
cd yolov5
pip install -qr requirements.txt onnx
python3 -c "from utils.google_utils import *; gdrive_download('1n_oKgR81BJtqk75b00eAjdv03qVCQn2f', 'coco128.zip')" && mv ./coco128 ../
export PYTHONPATH="$PWD" # to run *.py. files in subdirectories
for x in yolov5s #yolov5m yolov5l yolov5x # models
do
python train.py --weights $x.pt --cfg $x.yaml --epochs 3 --img 320 --device 0,1 # train
for di in 0,1 0 cpu # inference devices
do
python detect.py --weights $x.pt --device $di # detect official
python detect.py --weights runs/exp0/weights/last.pt --device $di # detect custom
python test.py --weights $x.pt --device $di # test official
python test.py --weights runs/exp0/weights/last.pt --device $di # test custom
done
python models/yolo.py --cfg $x.yaml # inspect
python models/export.py --weights $x.pt --img 640 --batch 1 # export
done
While running the Default code for 4 GPU, I am quite confused why only the first GPU is using a large amount of memory.
Furthermore, the speed of 4 GPU vs 1 GPU is kind of similar. I set both to run at the same time, and they are in the same epoch (34). It could be my CPU bottleneck though.
While running the Default code for 4 GPU, I am quite confused why only the first GPU is using a large amount of memory.
Furthermore, the speed of 4 GPU vs 1 GPU is kind of similar. I set both to run at the same time, and they are in the same epoch (34). It could be my CPU bottleneck though.
The 4-gpu DP should be faster. GPU0 using more memory in DP is normal
Fixed in #401
Hello, @glenn-jocher
From your advice, Multi-process DistributedDataParallel should be better than the Single process we have now.
I have been trying to change it here at my fork on my ddp branch. (Apologize that it's messy). However, I've reached into many issues.
Since it's in testing, I didn't account for the device being cpu as of now.
What I did so far
Added setup method to init_process_group and set
torch.cuda
deviceCalled
torch.multiprocessing.spawn
on the modified train functionCreated a new argument called world_size to be called when running script (we can change this to counting # of device later)Added condition checks for only 1 process to download weights file, remove the batch.jpg, saving checkpoints
~Added
dist.barrier()
while waiting for first process to do its job~~Replaced all
.to(device)
to.to(rank)
for each process.~~Changed map_location for loading weights.~
Added more parameters totrain
function because the processes cannot see the global variablesAdded DistributedSampler for multiple gpu on dataset so they each get a different sample
~Turned off tensorboard as I needed to pass
tb_writer
to train as argument to be able to use it.~Things to fix
Do not divide dataset for validation set increate_dataloader
Reduce the need to call world_size as argument to say that we want multiprocess~Cleaning up~
Fixing the inconsistent output prints (All process printing at once makes it hard to track)~Enable tensorboard again~
Splitting batch_size/learning rate/epoch for multiple GPUFigure out why global variables are always recalled (I disabledprint(hyp)
because of this)Problems
Since I am still learning, it is very likely I messed up the training. The learnt information from each epoch for each process may not be distributed among themselves because I tested training at it's 0 mAP. I read somewhere that a SyncBatch layer may be needed, but I am not sure how to add it.
Saving checkpoints is done by only the first process as multiprocess saving concurrently causes problem for
strip_optimizer
later on. I am not sure if this is the correct way.I am testing it, and it is much slower than using just one GPU, but I figured that if the training is fixed, it can be a good boost for multi-gpu.
I also understand that this isn't as high in your priority list but maybe some guidance would be nice. Thank you.