Optimality of `diagnose_detection`?

lostanlen commented 2 years ago

Hello @maRce10

I am reading throught label_detection.R and diagnose_detection.R and see you have defined "split positives" and "merged positives". I'm curious how you come up with decisions as to which event to keep or prune? Is your method optimal?

i'm also curious as to why you need to do this:

https://github.com/maRce10/ohun/blob/master/R/diagnose_detection.R#L180

how come you have a recall above 1? are you ever matching multiple predictions to the same reference?

maRce10 commented 2 years ago

not sure what you mean with "how you come up with decisions as to which event to keep or prune?". This function diagnoses a detection and one thing it does is counting the occurrence of split/merged positives. That said, you can use filter_detection.R to remove those ambiguous detections based on a criterion defined by the user. Split positives can make diagnose_detection.R count more TP that their actually are.

lostanlen commented 2 years ago

My question is: how can i get the largest set of overlapping reference-detection pairs such that each reference is matched at most once and each detection is matched at most once?

maRce10 commented 2 years ago

OK, so the goal of this function is to diagnose a detection. Detections can come from ohun's own functions or from other packages/software. For optimizing detections in ohun diagnosing is already incorporated into two functions: optimize_energy_detector and optimize_template_detector. You will see that they just iterate the correspondent detection functions (optimize_energy_detector and template_detector) over different combinations of tuning parameters and then call diagnose_detection on each interation. For diagnosing external software detections the user has to run diagnose_detection for each detection (although this can be done independently over different detections using the argument 'by'). Decisions on how to optimize a detection are left to the users as this is dependent on the goals of the detections (e.g. in some cases some FPs are ok but not in others). ohun only provides a bunch of diagnostics that can inform that decision.

About this issue:

https://github.com/maRce10/ohun/blob/master/R/diagnose_detection.R#L180

I realized forcing recall <= 1 is no longer necessary as it was dealt with earlier in the code here: https://github.com/maRce10/ohun/blob/master/R/diagnose_detection.R#L171

lostanlen commented 2 years ago

I am not asking about how to perform detection but about evaluating (diagnosing) it.

Let me give you an example. I have an event detector (blue segments) and a reference (black segments). Depending on how i match predictions to references, i might end up with 4 TP (top) or 3 TP (center, bottom).

signal-2022-08-17-191106

If i were to pass the black and blue segments to ohun, could you guarantee that diagnose_detection will return the optimal match (4 TP) and not some suboptimal variant (3 TP or less)? if not, it could be a problem because it could under-evaluate both the recall and the precision of a detector.

maRce10 commented 2 years ago

Ok I see. It will tell you that you have 4 TP because 4 reference sounds are overlapped by detections:

library(ohun)

# reference
ref <- data.frame(sound.files = "1.wav",
                  selec = 1:5,
                  start = c(1, 2, 3, 4, 5),
                  end = c(1.5, 2.5, 3.5, 4.5, 5.5)
                  )
# detection
det <- data.frame(sound.files = "1.wav",
                  selec = 1:4,
                  start = c(0.75, 1.4, 3.2, 4.25),
                  end = c(1.25, 3.1, 4.1, 4.8)
)

# diagnose
diagnose_detection(reference = ref, detection = det)

But it will also tell you that you have some split and merged positives

lostanlen commented 2 years ago

But this "4 reference sounds are overlapped by detections" is only an upper bound on TP, right?

What would happen in this other case ? signal-2022-08-17-194503_002

Is there a way i can get a TP=4 on the first case and a TP=1 in the latter?

maRce10 commented 2 years ago

in the latter you get TP = 4 as well but only one detection for that sound file and 4 merged positives. But I see your point. However I am not sure that just calling that TP = 1 is informative enough for the user. That's why I added these other metrics. Anyways, I am open to suggestions.

lostanlen commented 2 years ago

in the latter you get TP = 4 as well but only one detection for that sound file and 4 merged positives. But I see your point. However I am not sure that just calling that TP = 1 is informative enough for the user. That's why I added these other metrics.

The first example should unambiguously be TP=4, FP=1, FN=0. And the second example should be TP=1, FP=0, FN=3. The number of true positive should never be higher than the number of positives.

My suggestion would be to frame this as a combinatorial optimization problem: specifically, bipartite graph matching.

This problem could be solved in two steps: (1) first, building the bipartite graph by listing all matching pairs. The naïve way to do this is to consider all pairs and check for overlap. If the number of events is very large, this procedure can become slow. A faster way to do this would be sorted bisection search. This is what i did for the DCASE "few-shot bioacoustic event detection task": https://github.com/c4dm/dcase-few-shot-bioacoustic/blob/main/evaluation_metrics/metrics.py#L6

(2) then, running the Hopcroft-Karp algorithm to solve bipartite graph matching. mir_eval has an implementation of this in Python (_bipartite_match). There might be one in R that's already available, although i'm not familiar enough with R to comment.

maRce10 commented 2 years ago

thanks. That seems to be useful for assigning positives to TPs right?

lostanlen commented 2 years ago

first, list all candidate pairs between prediction and reference. in your case, the criterion for being a candidate pair is to have nonzero overlap. one might come up with a stricter criteria, such as having at least 50% Intersection-over-Union (IoU) ratio (what we did at DCASE FSD), or putting an upper bound on the lag between predicted onset and reference onset, or between predicted offset and reference offset. All these choices are comprehensively covered by Annamaria Mesaros in her sed_eval package.
then, among those candidates, run Hopcroft-Karp to find the optimal number of matching pair. That number is TP
FP = number of predicted events - TP
FN = number of reference events - TP
Precision, Recall, F-measure, etc. follow accordingly

maRce10 commented 2 years ago

ok, will take a look at that. thanks!

maRce10 commented 2 years ago

c4b096f

https://github.com/maRce10/ohun/blob/master/R/label_detection.R#L180

ropensci / ohun

Optimality of `diagnose_detection`? #6