ultralytics / ultralytics

Ultralytics YOLO11 🚀
https://docs.ultralytics.com
GNU Affero General Public License v3.0
36.33k stars 7k forks source link

How to use Metrics for a comparison of yolov11 with models from other repos #19057

Open asusdisciple opened 6 days ago

asusdisciple commented 6 days ago

Search before asking

Question

First of all thanks for providing this great library and making all of it open source, guys like you really bring the ML community forward!

My problem: I trained a lot of yolov11 models on different tasks and it works flawlessly so far, however for the sake of benchmarking I want to compare my yolo models to my old legacy models. Metrics for evaluation (for example mAP) are well defined in general but can have subtle implementation deviations between different repositories.

Thats why I would like to use the ultralytics metrics to evaluate both models, but its hard to find the code where this is done. For example: The implementation for IOU of bounding boxes can be easily found, but I cant find the logic for map50 or how how you decide which boxes are selectd (for example taking the max confidence, or max IOU or whatever when two boxes are very close to the GroundTruth).

Since my legacy model just provides a list of bbox, ideally I would like to call something like metrics.map50(list_of_boxes, gt_boxes) if thats possible.

I hope somebody can point me towards the right direction.

Additional

No response

UltralyticsAssistant commented 6 days ago

👋 Hello @asusdisciple, thank you for your interest in Ultralytics 🚀! We appreciate your thoughtful question! The ability to compare models using standardized metrics is definitely important for robust evaluations.

We recommend checking out the Docs for details on evaluation metrics and examples for Python and CLI usage. This includes how mAP and other metrics are calculated to give you an understanding of the evaluation processes.

For your particular inquiry, if you are exploring deeper into metrics like mAP@0.5 and specific methodologies, and how bounding box selections are performed, some of the core logic is integrated into the model's validation/evaluation pipelines. While it does not currently have a direct callable function like metrics.map50(list_of_boxes, gt_boxes) out-of-the-box, you can explore the code under the val.py script, where the evaluation logic resides.

If you need further guidance:

Upgrade

As always, ensure you are on the latest version of ultralytics in a compatible environment to avoid any outdated issues by running:

pip install -U ultralytics

Environments

YOLO models can be tested in any of the following verified environments:

You might also want to explore ways to adapt and test different models or outputs to YOLO standards for accurate evaluations.

Community and Support

For real-time discussions, join our Discord community 🎧. Alternatively, engage on our Discourse forum or Subreddit for in-depth model comparisons and evaluations.

An Ultralytics engineer will review your issue and provide additional assistance soon 🙂

glenn-jocher commented 5 days ago

@asusdisciple thank you for your kind words and question! For metric implementation details, we recommend reviewing our validation code in the BaseValidator class (source) and metrics calculations in metrics.py. To compare models fairly:

  1. Use identical validation datasets and settings (imgsz, conf, iou thresholds)
  2. Convert legacy model outputs to YOLO format (xywh, normalized)
  3. Process both through the same validation pipeline

Our metrics follow standard implementations but with optimized PyTorch backend. For exact implementation details, please refer to the linked source code sections. For commercial use comparisons, ensure compliance with our licensing terms at https://ultralytics.com/license.

Y-T-G commented 5 days ago

You can use this:

from ultralytics import YOLO, ASSETS
model = YOLO("yolo11n.pt")
results = model(ASSETS / "bus.jpg")

from ultralytics.utils.metrics import DetMetrics
from ultralytics.models.yolo.detect.val import DetectionValidator

metrics = DetMetrics()
val = DetectionValidator()
metrics.names = model.names  # should be the dictionary of classes

num_images = 1
for i in range(num_images):
  # Process each images prediction and GT separately.
  # We use the prediction from model. But it can be your saved predictions. Shape [N, 6]. Type: Tensor.
  boxes = results[0].boxes.data.cpu()
  pred_conf = boxes[:, 4]
  pred_cls = boxes[:, 5]
  # Should be you ground truth. Using prediction as ground truth as example.
  gt_cls = pred_cls  # Shape [N, 1]
  gt_boxes = boxes[:, :4]  # Shape [N, 4]
  tp = val._process_batch(boxes, gt_boxes, gt_cls).int()
  metrics.process(tp, pred_conf, pred_cls, gt_cls)

# This will print all the metrics
print(metrics.results_dict)