YOLOv7 - results.png - val/obj_loss figure loss trend line increases instead of decreasing as in YOLOv5's case

valentinitnelav commented 2 years ago

Hi @stark-t , our first YOLOv7 job just finished with a run time of 1 day 11 h for a model with weights yolov7-w6.pt, batch size 4 and using "SyncBatchNorm" option.

This was part of the job script:

python -m torch.distributed.launch --nproc_per_node 8 train.py \
--sync-bn \
--weights ~/PAI/detectors/yolov7/weights_v0_1/yolov7-w6.pt \
--data ~/PAI/scripts/config_yolov5.yaml \
--hyp ~/PAI/detectors/yolov7/data/hyp.scratch.p5.yaml \ # we implemented a custom one now
--epochs 300 \
--batch-size 32 \
--img-size 1280 1280 \
--workers 6 \ # should be 3, not 6, I implemented the correction
--name yolov7_n6_b8_e300_hyp_p5 # should be w6, not n6! I implemented the correction

Here is my concern: the figure "val/obj_loss" in results.png file is increasing instead of decreasing with the number of epochs, while the other figures look "normal": yolov7_results confusion_matrix

For reference, here is a results.png file from a YOLOv5 weights s on the same dataset. This run in 15 hours (same GPU settings), but no "SyncBatchNorm" option and a batch size of 8:

python -m torch.distributed.launch --nproc_per_node 8 train.py \
--weights ~/PAI/detectors/yolov5/weights_v6_1/yolov5s6.pt \
--data ~/PAI/scripts/config_yolov5.yaml \
--hyp hyp.scratch-med.yaml \
--epochs 300 \
--batch-size 64 \
--imgsz 1280 \
--workers 6 \
--name p1_w-s6_hyp-med_8b_300e

yolov5_results confusion_matrix

valentinitnelav commented 2 years ago

I run another train job for YOLOv7, this time without the --sync-bn option (as I thought this might lead to some bug since that was the only difference between the v5 & v7 two train jobs), but still got the same bizarre looking graph for "val Objectness".

However, despite what I read about --sync-bn option, here https://github.com/ultralytics/yolov5/issues/475, it took more time to run without than with this option activated. The train job was actually stopped by the scheduler due to running to of the requested time of 50 hours (which usually was plenty for other runs). But this is outside the scope of the current issue for now.

"SyncBatchNorm could increase accuracy for multiple gpu training, however, it will slow down training by a significant factor. It is only available for Multiple GPU DistributedDataParallel training. It is best used when the batch-size on each GPU is small (<= 8)"

valentinitnelav commented 1 year ago

This might have been solved meanwhile. I did a recent git pull when I run the YOLOv7 and the trend in the problematic graph starts to look more like expected - see https://github.com/stark-t/PAI/issues/55#issuecomment-1280018312

stark-t / PAI

YOLOv7 - results.png - val/obj_loss figure loss trend line increases instead of decreasing as in YOLOv5's case #37