Closed valentinitnelav closed 1 year ago
I run another train job for YOLOv7, this time without the --sync-bn
option (as I thought this might lead to some bug since that was the only difference between the v5 & v7 two train jobs), but still got the same bizarre looking graph for "val Objectness".
However, despite what I read about --sync-bn
option, here https://github.com/ultralytics/yolov5/issues/475, it took more time to run without than with this option activated. The train job was actually stopped by the scheduler due to running to of the requested time of 50 hours (which usually was plenty for other runs). But this is outside the scope of the current issue for now.
"SyncBatchNorm could increase accuracy for multiple gpu training, however, it will slow down training by a significant factor. It is only available for Multiple GPU DistributedDataParallel training. It is best used when the batch-size on each GPU is small (<= 8)"
This might have been solved meanwhile. I did a recent git pull
when I run the YOLOv7 and the trend in the problematic graph starts to look more like expected - see https://github.com/stark-t/PAI/issues/55#issuecomment-1280018312
Hi @stark-t , our first YOLOv7 job just finished with a run time of 1 day 11 h for a model with weights
yolov7-w6.pt
, batch size 4 and using "SyncBatchNorm" option.This was part of the job script:
Here is my concern: the figure "val/obj_loss" in results.png file is increasing instead of decreasing with the number of epochs, while the other figures look "normal":
For reference, here is a results.png file from a YOLOv5 weights
s
on the same dataset. This run in 15 hours (same GPU settings), but no "SyncBatchNorm" option and a batch size of 8: