Is it possible that my model never converges

ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite

https://docs.ultralytics.com

GNU Affero General Public License v3.0

50.6k stars 16.32k forks source link

Is it possible that my model never converges #9864

Closed arun-gautham closed 1 year ago

arun-gautham commented 2 years ago

Search before asking

[X] I have searched the YOLOv5 issues and discussions and found no similar questions.

Question

Iam training a model which seems to never end, the best.pt was generated about 200 epocs back

python train.py --img 1280 --batch 4 --epochs 4000 --hyp /content/yolov5/data/hyps/xxxxx.yaml --data {dataset.location}/data.yaml --resume /last.pt --name --exist-ok

How ever it is to be noted that i never did get 100 iterations in one session, due to colab sessions.

Is it that we need 100 iterations to stop in one session, or it could be in multiple sessions ?

This is the first time iam facing this issue , earlier we reached the best in about 150 epochs.

Additional

No response

MartinPedersenpp commented 2 years ago

https://github.com/ultralytics/yolov5/blob/6371de8879e7ad7ec5283e8b95cc6dd85d6a5e72/utils/torch_utils.py#L380-L401

I know that best_fitness gets stored in the last.pt file, so I assume that all of the Earlystopper parameters gets stored as well, can you post your latest 100 results.csv lines and check that there has been no improvement in the last 100 epochs.

arun-gautham commented 2 years ago

results.csv

Attached is the csv file

MartinPedersenpp commented 2 years ago

Okay, so the best fitness value was around epoch 160, so something indicates that resuming might not pass values to the earlystopping feature Edit: The EarlyStopping utlity does not get passed the best fitness when resuming, so if you run multiple "short" sessions, you will not be able to reach the patience point. One could do a PR to fix this

ExtReMLapin commented 2 years ago

It does converge, it just does it really early in the training

github-actions[bot] commented 1 year ago

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Wiki – https://github.com/ultralytics/yolov5/wiki
Tutorials – https://docs.ultralytics.com/yolov5
Docs – https://docs.ultralytics.com

Access additional Ultralytics ⚡ resources:

Ultralytics HUB – https://ultralytics.com/hub
Vision API – https://ultralytics.com/yolov5
About Us – https://ultralytics.com/about
Join Our Team – https://ultralytics.com/work
Contact Us – https://ultralytics.com/contact

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

glenn-jocher commented 11 months ago

@ExtReMLapin thank you for the feedback. It seems that the model converges quite early in the training process. Regarding the EarlyStopping utility, you are correct, when resuming, the best fitness value is not passed to the feature.

If you are considering contributing, a pull request to address this issue would be greatly appreciated.

Your contribution would benefit the YOLO community and the Ultralytics team.