inconsistent validation result

ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite

https://docs.ultralytics.com

GNU Affero General Public License v3.0

51.2k stars 16.43k forks source link

inconsistent validation result #9319

Closed twangnh closed 2 years ago

twangnh commented 2 years ago

Search before asking

[X] I have searched the YOLOv5 issues and discussions and found no similar questions.

Question

Hi I tried evaluating a same model multiple times, I did not modify any part of the code, but the result is not the same across different runs, could anyone help give a hint?

Additional

No response

glenn-jocher commented 2 years ago

👋 hi, thanks for letting us know about this possible problem with YOLOv5 🚀. We've created a few short guidelines below to help users provide what we need in order to start investigating a possible problem.

How to create a Minimal, Reproducible Example

When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:

✅ Minimal – Use as little code as possible to produce the problem
✅ Complete – Provide all parts someone else needs to reproduce the problem
✅ Reproducible – Test the code you're about to provide to make sure it reproduces the problem

For Ultralytics to provide assistance your code should also be:

✅ Current – Verify that your code is up-to-date with GitHub master, and if necessary git pull or git clone a new copy to ensure your problem has not already been solved in master.
✅ Unmodified – Your problem must be reproducible using official YOLOv5 code without changes. Ultralytics does not provide support for custom code ⚠️.

If you believe your problem meets all the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template with a minimum reproducible example to help us better understand and diagnose your problem.

Thank you! 😃

ZoellaUber commented 2 years ago

Hi there! I had the same problem with this. It's normal for the YOLOv5 model results to not show the same results for every run, even on the same weights.

You might need to round off the decimal values to make the validation work.

glenn-jocher commented 2 years ago

👋 Hello! Thanks for asking about training reproducibility. YOLOv5 🚀 uses a single training seed which is set here using the init_seeds() function. CPU and Single-GPU trainings should be fully reproducible with torch>=1.12.0. Multi-GPU DDP trainings are still not reproducible unfortunately. This is an open issue for us and we could use any help in tracking down this problem. https://github.com/ultralytics/yolov5/blob/7215a0fb41a90d8a0bf259fa708dff608a1f0262/train.py#L104

This function sets python, numpy and torch seeds and updates cudnn settings: https://github.com/ultralytics/yolov5/blob/7215a0fb41a90d8a0bf259fa708dff608a1f0262/utils/general.py#L198-L214

To set a new training seed for example:

python train.py --seed 3  # default seed=0

Note that even when using the same seed trainings may produce different results, especially when using CUDA backends. See https://pytorch.org/docs/stable/notes/randomness.html for details on factors affecting PyTorch training reproducibility and sources of randomness.

Good luck 🍀 and let us know if you have any other questions!

github-actions[bot] commented 2 years ago

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Wiki – https://github.com/ultralytics/yolov5/wiki
Tutorials – https://docs.ultralytics.com/yolov5
Docs – https://docs.ultralytics.com

Access additional Ultralytics ⚡ resources:

Ultralytics HUB – https://ultralytics.com/hub
Vision API – https://ultralytics.com/yolov5
About Us – https://ultralytics.com/about
Join Our Team – https://ultralytics.com/work
Contact Us – https://ultralytics.com/contact

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

vlesu commented 2 years ago

I had similar problem. This command on my machine produced slightly DIFFERENT lines of results on nvidia GPU:

for i in {1..15}; do python val.py --weights yolov5x.pt --data coco.yaml --img 640 --verbose --save-txt > 1.txt 2>&1 && cat 1.txt | grep person ; done

After two weeks of trials I changed the following on my computer:

An error has been detected in one of the main memory banks (hardware replaced)
Complete uninstall of all nvidia products (cuda+driver) and reinstall cuda to 11.7 version from ubuntu .deb local installer
Complete reinstall of torch to match cuda version (1.13.0 cuda11.7)

Now everthing is working propertly - yeah! Lines exactly the same!

I do not sure, but I feel maybe some problems in reproducibility of CUDA operations if kernel module compiled version of driver mismatch cuda version, as a side effect of using "runtime" cuda installer (which I used before)

Hope this helps.

glenn-jocher commented 2 years ago

@vlesu got it, thanks for your feedback! Yes some CUDA ops in torch may have reproducibility issues that CPUs do not have, but in general operations should be reproducible.