Wrong Metrics Calculated During Hyperparameter Tuning

mateuszwalo commented 4 months ago

Search before asking

[X] I have searched the YOLOv8 issues and found no similar bug report.

YOLOv8 Component

Hyperparameter Tuning

Bug

I discovered an error during the hyperparameter tuning process. The metrics reported during this phase are calculated incorrectly. Here are several screenshots and code snippets that demonstrate this issue:

Tuner: 9/100 iterations complete ✅ (2757.64s) Tuner: Results saved to runs/detect/tune5 Tuner: Best fitness=0.93986 observed at iteration 6 Tuner: Best fitness metrics are {'metrics/precision(B)': 0.98181, 'metrics/recall(B)': 0.96721, 'metrics/mAP50(B)': 0.98822, 'metrics/mAP50-95(B)': 0.93448, 'val/box_loss': 0.32963, 'val/cls_loss': 0.25273, 'val/dfl_loss': 1.00118, 'fitness': 0.93986} Tuner: Best fitness model is runs/detect/train20 Tuner: Best fitness hyperparameters are printed below.

Printing 'runs/detect/tune5/best_hyperparameters.yaml'

lr0: 0.00712 lrf: 0.00788 momentum: 0.87901 weight_decay: 0.0004 warmup_epochs: 2.15061 warmup_momentum: 0.53286 box: 6.03385 cls: 0.42 dfl: 1.88721 hsv_h: 0.01581 hsv_s: 0.52275 hsv_v: 0.46993 degrees: 0.0 translate: 0.07884 scale: 0.39232 shear: 0.0 perspective: 0.0 flipud: 0.0 fliplr: 0.57013 mosaic: 0.85197 mixup: 0.0 copy_paste: 0.0

After opening the contents of the path runs/detect/train20 to view the confusion matrix for the objects predicted by the model, I obtained the following:

Based on our knowledge from data exploration, we want to calculate Recall and Precision where

After substituting the confusion matrix results into the formulas, we obtain the following results:

Precision = 0,983 Recall = 0,959

These results differ from the ones listed during tuning: 'precision(B)': 0.98181, 'recall(B)': 0.96721, which can be misleading. This bug occurs every time I perform hyperparameter tuning and affects only the metrics calculated using the confusion matrix. Here is the example for the best fitness model. Other models with different parameters show even greater differences between the metrics reported by YOLO and the actual results.

Environment

No response

Minimal Reproducible Example

model=YOLO(f"{HOME}/runs/detect/train/weights/best.pt") data=f"{dataset.location}/data.yaml" model.tune(data=data, epochs=10, iterations=100, optimizer="AdamW", plots=False, save=False, val=False)

Additional

No response

Are you willing to submit a PR?

[X] Yes I'd like to help by submitting a PR!

github-actions[bot] commented 4 months ago

👋 Hello @mateuszwalo, thank you for your interest in Ultralytics YOLOv8 🚀! We recommend a visit to the Docs for new users where you can find many Python and CLI usage examples and where many of the most common questions may already be answered.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Join the vibrant Ultralytics Discord 🎧 community for real-time conversations and collaborations. This platform offers a perfect space to inquire, showcase your work, and connect with fellow Ultralytics users.

Install

Pip install the ultralytics package including all requirements in a Python>=3.8 environment with PyTorch>=1.8.

pip install ultralytics

Environments

YOLOv8 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all Ultralytics CI tests are currently passing. CI tests verify correct operation of all YOLOv8 Modes and Tasks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

Y-T-G commented 4 months ago

The recall and precision is based on IoU threshold that produces the best F1-score. It won't correspond with the generated confusion matrix because that uses conf threshold of 0.25 and iou threshold of 0.45.

ultralytics / ultralytics