Add a thresholding API.

GeorgePearse commented 11 months ago

Search before asking

[X] I have searched the Supervision issues and found no similar feature requests.

Description

Create a simple API to find the best thresholds to maximise some metric (f1-score, precision, recall), given an annotated dataset and a model.

At the minute I use the below, because it's the only repo that I've found to calculate what I need, in a reasonable time frame.

https://github.com/yhsmiley/fdet-api

Use case

Anyone wanting to deploy models without manual thresholding (or viewing graphs).

Additional

No response

Are you willing to submit a PR?

[ ] Yes I'd like to help by submitting a PR!

RigvedRocks commented 11 months ago

I'd like to help by submitting a PR

SkalskiP commented 10 months ago

Hi @GeorgePearse and @RigvedRocks 👋🏻 ! Thanks for your interest in supervision. I am sorry that I have not been responsive for the last few days. Before Christmas, I was busy with duties unrelated to supervision, and I was off for the last few days.

The idea looks interesting. @RigvedRocks could you share some initial ideas regarding implementation?

RigvedRocks commented 10 months ago

I was thinking of using the basic techniques used in ML such as using the roc curve or using Youden J's statistic but the above approach outlined by @GeorgePearse works for me. I guess I can collaborate with @GeorgePearse to work on this issue if he insists so.

GeorgePearse commented 9 months ago

I'd really like to do what I can to keep this ticking over, @SkalskiP do you also think it's valuable? I'm always surprised by the lack of open-source implementations for this, and assume that every company just has their own fix.

@RigvedRocks we could do something like I try to create a branch with a "workable" solution from fdet-api, but starting from the supervision format, and you take it from there? Let me know if that might interest you?

@josephofiowa also curious to hear your thoughts, I used to do it with some voxel51 code (they have a method from which you can get all of the matching predictions for a given IoU), but it was painfully slow.

I keep assuming a "good" solution must exist, but think that the emphasis on threshold agnostic metrics (map etc.) in academia means that it's not given much attention.

SkalskiP commented 9 months ago

Hi @GeorgePearse 👋🏻 I like the idea and I'd love to look at your initial implementation. If possible, I want the solution:

As much as possible, used the Metrics API already found in Supervision
If possible, it did not use external libraries. One of the main principles of Supervision is to limit external dependencies.

Such a solution requires a lot of steps, so I need to understand how we can combine it with what we have and how to design the next elements to be as reusable as possible. We will also need to come up with a better name for this task and a better name for the feature. Haha

GeorgePearse commented 9 months ago

Yeah all makes sense, tbh, the reason I want it to be integrated into supervision is to solve those very problems, at the minute I'm dealing with a lot of opaque code, I only trust the outputs from having visually inspected the predictions from lots of model/threshold combos that have used it.

As for API questions.

Just something like

# Ideally target_metric could also be a callback so that you a user could customise exactly what they want 
# to optimize for
per_class_thresholds: dict = optimize_thresholds(
    predictions_dataset,
    annotations_dataset, 
    target_metric='f1_score', 
    per_class=True,
    minimum_iou=0.75,
)

SkalskiP commented 9 months ago

And what is stored inside per_class_thresholds? Dict[int, float] - class id to optimal IoU mapping?

What's inside optimize_thresholds? I'd appreciate any pseudocode.

GeorgePearse commented 9 months ago

class id to optimal score, the minimum IoU to classify a prediction and annotation as a match is set upfront by the user. Is that not the far more common use case for shipping ML products? The minimum IoU is defined by business/product requirements / can be done easily enough visually on a handful of examples. Maybe I'm biased by having mostly trained models where localisation is of secondary importance to classification, and a much much easier problem.

GeorgePearse commented 9 months ago

Complete pseudocode:

metrics = []
for class_name in class_list: 
    for threshold in range(0, 1, 100): 
          current_metric = calculate_metric(grid_of_matched_predictions_and_their_scores, metric='f1_score')
          metrics.append({
              'threshold': threshold,
              'class_name': class_name,
              'metric': metric,
          })

metrics_df = pd.DataFrame(metrics)

# but so that you get a row per class, whatever the .groupby() kind of query 
# would be to achieve that 
best_metrics = metrics_df[metrics_df['metric'] == max(metrics_df['metric'])

But everything probably needs to calculated in numpy to not make it painfully slow.

There's a decent chance that this is where most people get this data from currently https://github.com/rafaelpadilla/Object-Detection-Metrics, but the repo is as you'd expect of something 5/6 years old, and doesn't have the useability/documentation of a modern open-core project.

GeorgePearse commented 9 months ago

This is what using the fdet-api looks like for me at the minute

thresholds = []
thresholds_dict = {}
f1_score_dict = {}

for counter, class_name in enumerate(annotation_class_names):
      (
          class_name,
          fscore,
          conf,
          precision,
          recall,
          support,
      ) = cocoEval.getBestFBeta(
          beta=1, iouThr=0.5, classIdx=counter, average="macro"
      )
      class_threshold_dict = {
          "class_name": class_name,
          "fscore": fscore,
          "conf": conf,
          "precision": precision,
          "recall": recall,
          "support": support,
      }
      f1_score_dict[class_name] = fscore
      thresholds.append(class_threshold_dict)
      thresholds_dict[class_name] = conf

thresholds_df = pd.DataFrame(thresholds)
print(thresholds_df)

So I end up with both the threshold to achieve the metric I care about, and the metrics that that threshold achieves

SkalskiP commented 9 months ago

Understood. This sounds interesting to me. I'm worried about scope, especially if we want to reimplement all metrics.

We need to divide work into smaller chunks. I am plodding with reviews when I need to go through 2k lines of code. On top of this we could assign different tasks to different external contributors and speed up the work.
We need to develop MVP - the shortest path demonstrating the value of the potential solution. With only 1 metric for example.

RigvedRocks commented 9 months ago

@GeorgePearse Fine by me. You can create a new branch called and then I can refine your initial solution.

GeorgePearse commented 9 months ago

From a look through the metric functionality already implemented it looks like it wouldn't be too painful to add in. The object that comes out of this looks like it's already done most of the upfront maths needed.

Hard for me to tell just from a look, does the output structure contain the scores?

SkalskiP commented 9 months ago

@GeorgePearse, could you be a bit more specific? What do you mean by scores?

roboflow / supervision