Segmentation evaluation - boundary detection: to trim or not to trim / special segment labels

mir-evaluation / mir_eval

Evaluation functions for music/audio information retrieval/signal processing algorithms.

MIT License

615 stars 115 forks source link

Segmentation evaluation - boundary detection: to trim or not to trim / special segment labels #115

Closed grrrr closed 9 years ago

grrrr commented 9 years ago

Quite frequently, ground-truth annotations contain leading and/or trailing silence, typically (e.g. in SALAMI) annotated using special segment labels like 'Silence'. When such a special segment label appears at the front, the start time of this segment does not contain a real boundary which boundary detection algorithms would find, conversely for ending times of silent end segments. Currently, it seems that the segment.detection function must be tricked by introducing very short silent segments, if those are not already present in the ground-truth, to consistently trim them, or not.

bmcfee commented 9 years ago

Quite frequently, ground-truth annotations contain leading and/or trailing silence, typically (e.g. in SALAMI) annotated using special segment labels like 'Silence'. When such a special segment label appears at the front, the start time of this segment does not contain a real boundary which boundary detection algorithms would find, conversely for ending times of silent end segments.

Do you mean the start or the end of the first segment?

In a well-formed annotation, the start time should be 0, which I've argued should never be counted as positive hits. Setting trim=True will discount the first and last segment boundaries: the only reason it's set to False is backwards compatibility against mirex by default.

If you mean the end time of the first segment, then you're talking about non-trivial manipulation of the annotation data, and I'd be very careful here.

Currently, it seems that the segment.detection function must be tricked by introducing very short silent segments, if those are not already present in the ground-truth, to consistently trim them, or not.

I'm not sure what you mean here: are you talking about modifying the reference annotations? Or the estimated annotations? I don't see why either should be necessary.

grrrr commented 9 years ago

Quite frequently, ground-truth annotations contain leading and/or trailing silence, typically (e.g. in SALAMI) annotated using special segment labels like 'Silence'. When such a special segment label appears at the front, the start time of this segment does not contain a real boundary which boundary detection algorithms would find, conversely for ending times of silent end segments.

Do you mean the start or the end of the first segment?

The start of the first segment, or conversely, the end of the last segment.

In a well-formed annotation, the start time should be 0, which I've argued should never be counted as positive hits. Setting trim=True will discount the first and last segment boundaries: the only reason it's set to False is backwards compatibility against mirex by default.

Exactly, but this is difficult terrain. In SALAMI, the two following variations appear equally often, here shown with (start,end,label) triplets:

(0, 0.3, 'Silence'), (0.3, 10, 'A'), .... , (120, 130, 'C'), (130, 131, 'Silence') or (0, 10, 'A'), .... , (120, 131, 'C')

In the first case, the annotator was very careful checking leading and trailing silence (the latter probably being a fade-out), in the second case the annotator was less pedantic.

Trimming both versions in segment.detection yields completely different sets of reference boundaries.

Let's assume that a typical good boundary detector would detect boundary estimates at approximately 0 seconds and somewhere between 130 and 131 seconds, comparing the two trimmed reference sets would result in very different evaluations. Right now, segment.detection is not aware of the fact that a leading or trailing part labeled 'Silence' must be handled differently than a non-silent part.

Currently, it seems that the segment.detection function must be tricked by introducing very short silent segments, if those are not already present in the ground-truth, to consistently trim them, or not.

I'm not sure what you mean here: are you talking about modifying the reference annotations? Or the estimated annotations? I don't see why either should be necessary.

If you look at the two examples, the second one missing leading and trailing silences could be converted to something like (0,0.001, 'Silence'), (0.001, 10, 'A'), .... , (120, 130.199, 'C'),(130.999,131, 'Silence') so that both variations would be approximately the same after trimming.

craffel commented 9 years ago

This sounds a lot to me like a specific problem with SALAMI due to the annotators not following the same annotation rules.

Right now, segment.detection is not aware of the fact that a leading or trailing part labeled 'Silence' must be handled differently than a non-silent part.

Unless this is a widely-agreed upon (across annotators, datasets, researchers, etc) standard, i.e. "When the first or last segment is labeled Silence, ignore it", I think it's outside of the scope of mir_eval. That is, it seems like a dataset/annotator-specific issue which should be resolved by some data cleaning before mir_eval is used at all.

grrrr commented 9 years ago

It is actually a very general problem with any kind of human annotation, but if this is the way mir_eval should be used, i am fine with it.

bmcfee commented 9 years ago

Unless this is a widely-agreed upon (across annotators, datasets, researchers, etc) standard, i.e. "When the first or last segment is labeled Silence, ignore it", I think it's outside of the scope of mir_eval. That is, it seems like a dataset/annotator-specific issue which should be resolved by some data cleaning before mir_eval is used at all.

:+1:

This is a pretty nasty rabbit hole, especially if you apply the same logic to segment labels. For instance, the mirex implementation of structural annotation metrics has a bunch of special cases for segment labels to ignore (see here for one such example).

I don't think mir_eval can or should try to solve this problem in general, but rather punt back upstream to annotators/data collectors to be more precise about annotation schemes.

craffel commented 9 years ago

It is actually a very general problem with any kind of human annotation, but if this is the way mir_eval should be used, i am fine with it.

Yes, mir_eval as a philosophy tries to avoid dealing with annotation issues - i.e., it assumes that the annotations are clean/correct - because this is a realm where researchers can disagree a lot. In order to be as "standard" as possible, we need to leave issues like this up to the annotators/dataset.

I don't think mir_eval can or should try to solve this problem in general, but rather punt back upstream to annotators/data collectors to be more precise about annotation schemes.

OK, closing this. Thanks for bringing this up, Thomas.

f0k commented 9 years ago

In a well-formed annotation, the start time should be 0, which I've argued should never be counted as positive hits.

I wouldn't agree with that. As Thomas said, there may be two different situations. A file could start with silence: 0 1.2 Silence 1.2 10 A or a file could start right away with the song: 0 12.5 A 12.5 20 B

Do you say that getting the boundary at 1.2 for the first file should be counted as a hit, while getting the boundary at 0.0 for the second file should be discarded as trivial? And do you say that every boundary detector should output a boundary at 0.0 for the first song? Note that the same holds for the endings of songs. I would argue that a predicted boundary at 0.0 for the first song should be a false positive, a boundary at 1.2 for the first song should be a true positive, and a boundary at 0.0 for the second song should be a true positive. Alternatively, the boundaries at 1.2 for the first song and at 0.0 for the second song should be ignored as being trivial. Ignoring 0.0 for both the first and the second song is inconsistent.

Yes, mir_eval as a philosophy tries to avoid dealing with annotation issues

This is not an annotation issue. Some files do start or end with silence while others don't, and the evaluation should be able to take this into account.

A possible solution would be for us to preprocess the SALAMI ground truth to discard "Silence" segments at the beginning and end and set trim=False, but from what I've seen, then the evaluation goes wrong because it uses the minimum and maximum of the ground truth to trim or pad the predictions.

urinieto commented 9 years ago

Hi all,

The way I approached this problem, without having to modify any reference annotations or the mir_eval implementation, is the following:

Detect all boundaries in a track
If the first boundary detected does not start at time 0, add a "silent" segment in the beginning (this might happen e.g., when you work with beat-synchronous features).
Do the same for the end: if the last boundary detected does not stop at the exact duration of the track, add a "silent" segment at the end.

As you can see, the first and final boundaries are actually trivial to detect, and therefore mir_eval provides a way to "remove" these boundaries in order to, in theory, have a less biased evaluation.

Does this make sense?

Regardless, very interesting discussion guys!

On Thu, Apr 9, 2015 at 9:22 AM, Jan Schlüter notifications@github.com wrote:

In a well-formed annotation, the start time should be 0, which I've argued should never be counted as positive hits.

I wouldn't agree with that. As Thomas said, there may be two different situations. A file could start with silence: 0 1.2 Silence 1.2 10 A or a file could start right away with the song: 0 12.5 A 12.5 20 B

Do you say that getting the boundary at 1.2 for the first file should be counted as a hit, while getting the boundary at 0.0 for the second file should be discarded as trivial? And do you say that every boundary detector should output a boundary at 0.0 for the first song? Note that the same holds for the endings of songs. I would argue that a predicted boundary at 0.0 for the first song should be a false positive, a boundary at 1.2 for the first song should be a true positive, and a boundary at 0.0 for the second song should be a true positive. Alternatively, the boundaries at 1.2 for the first song and at 0.0 for the second song should be ignored as being trivial. Ignoring 0.0 for both the first and the second song is inconsistent.

Yes, mir_eval as a philosophy tries to avoid dealing with annotation issues

This is not an annotation issue. Some files do start or end with silence while others don't, and the evaluation should be able to take this into account.

A possible solution would be for us to preprocess the SALAMI ground truth to discard "Silence" segments at the beginning and end and set trim=False, but from what I've seen, then the evaluation goes wrong because it uses the minimum and maximum of the ground truth to trim or pad the predictions.

— Reply to this email directly or view it on GitHub https://github.com/craffel/mir_eval/issues/115#issuecomment-91230223.

bmcfee commented 9 years ago

Do you say that getting the boundary at 1.2 for the first file should be counted as a hit, while getting the boundary at 0.0 for the second file should be discarded as trivial? And do you say that every boundary detector should output a boundary at 0.0 for the first song? Note that the same holds for the endings of songs.

Yes, and yes. There are a few arguments in favor of both.

There's more to structure than just boundary event detection: structure annotation is also important. Most annotation metrics operate on a fixed set of sample times spanning [0, T_MAX], so all annotations for a given track have to cover that range in order to be comparable to the reference. If they don't, we have to pad or trim them before the metrics make any sense. It's therefore in the algorithm's interest to get the start/end time correct so as to avoid pollution from synthetic padding labels.
In light of 1., you might suggest that boundary detection and structural annotation use different formats. I'd argue that segmentation should be always be annotated as intervals, not just event times, since events can be ambiguous in isolation. (If you don't believe me, look at the crazy difficulty in parsing the original salami functional annotations.) If you're going to annotate intervals, you may as well start at 0, since all tracks start there, so it's easy to detect and discard when evaluating.
You can argue about whether "silence -> non-silence" is a meaningful transition, but I think it is. If you don't want to evaluate on those, that's your business, but it's not mir_eval's place to make that call.

I would argue that a predicted boundary at 0.0 for the first song should be a false positive, a boundary at 1.2 for the first song should be a true positive, and a boundary at 0.0 for the second song should be a true positive. Alternatively, the boundaries at 1.2 for the first song and at 0.0 for the second song should be ignored as being trivial. Ignoring 0.0 for both the first and the second song is inconsistent.

Wait, which one is the reference and which is the estimate? And would you still think 1.2 is trivial if the label was crowd noise instead of silence?

I'd say that ignoring 0 for both is entirely consistent, if you accept that silence->nonsilence is meaningful, ie, if you take the first annotation as reference.

f0k commented 9 years ago

Wait, which one is the reference and which is the estimate?

I considered the two examples I gave to be references for two different songs: One that starts with 1.2 seconds of silence, and one that starts with music right away.

And would you still think 1.2 is trivial if the label was crowd noise instead of silence?

I don't necessarily think 1.2 is trivial. I thought the idea of trim=True was to treat silence->nonsilence and nonsilence->silence as not meaningful, because they're easy to detect and have a significant influence on results.

The problem I see comes from confounding two things: The beginning/end of a file, and the beginning/end of the music. In some cases, they coincide, and in some cases they don't. This is why I think these cases need to be distinguished in the evaluation. It's a separate question of whether detecting the beginning/end of the music is meaningful or not (I'd say it is, but it can be instructive to compare algorithms on their performance of detecting boundaries within the music only). "Detecting" the beginning/end of a file should not be relevant for anything.

bmcfee commented 9 years ago

I don't necessarily think 1.2 is trivial. I thought the idea of trim=True was to treat silence->nonsilence and nonsilence->silence as not meaningful, because they're easy to detect and have a significant influence on results.

No, the idea is really just to suppress the really obvious freebies that arise from the necessity of interval-based annotation. Silence->nonsilence may or may not be trivial, but trim doesn't make that call.

The problem I see comes from confounding two things: The beginning/end of a file, and the beginning/end of the music. In some cases, they coincide, and in some cases they don't.

Quite correct. I'd say that in no case is beginning-of-file important. In the cases where they coincide, I'd say that beginning-of-music is also not meaningful, since there's no contrasting prior observation (eg, at negative time).

"Detecting" the beginning/end of a file should not be relevant for anything.

:+1:

craffel commented 9 years ago

I'm going to defer the actual segmentation discussion to you all because it's not my area, but in response to this

This is not an annotation issue. Some files do start or end with silence while others don't, and the evaluation should be able to take this into account.

As far as I can tell, the issue is that for some annotators in one dataset, a label 'Silence' is included at the beginning or end. If mir_eval handles this, do we also need to (as @bmcfee pointed out) handle Crowd Noise? What about Fade-in? Silent Intro? Etc. All of these could vary across annotators, datasets, etc. Unless it is truly standardized to include a label Silence which should be ignored (not silence or SILENCE etc), then this is an annotator/dataset-specific cleaning issue, as far as I can tell, and is not within the scope of mir_eval.

The problem I see comes from confounding two things: The beginning/end of a file, and the beginning/end of the music. In some cases, they coincide, and in some cases they don't. This is why I think these cases need to be distinguished in the evaluation.

I'm seeing this issue - but it's different from what I was referring to above, and I'm going to defer to you all to come up with an agreed-upon solution..!

grrrr commented 9 years ago

Hi all, as i am dealing with this issue for the MIREX submission again, some further input:

stripping of leading or trailing known "silent" segments (denoting the absence of any signal) from ground-truth and predictions should be done in a preprocessing step. It seems we agree that this a not in the scope of mir_eval.
the first and last resulting boundary should be included in the evaluation (not stripped). There are two reasons: 1) For the example of SALAMI audio, only a part of the audio files are hard-clipped to the extents of the audio file (trivial boundary). Most have some amount of silence at the beginning or end. Detection of the respective first or last boundary is a truly non-trivial task (e.g., fade-in/out), so to my mind these boundaries should be included in the evaluation. Of course, if the temporal tolerance is high enough (e.g., 3 seconds) most of these transitions will practically coincide with file extents, but the choice of tolerance is another issue. 2) The number of boundaries for many pieces is really low. Stripping two of them voluntarily increases statistical variance considerably. For some pieces there might even not be any other annotated boundaries than just these.

bmcfee commented 9 years ago

stripping of leading or trailing known "silent" segments (denoting the absence of any signal) from ground-truth and predictions should be done in a preprocessing step. It seems we agree that this a not in the scope of mir_eval.

I agree that this isn't mir_eval's job; I'm not sure I agree that it should be done though. If, as you say below, the transition from silence to non-silence is meaningful, then shouldn't that information be retained?

the first and last resulting boundary should be included in the evaluation (not stripped). There are two reasons: 1) For the example of SALAMI audio, only a part of the audio files are hard-clipped to the extents of the audio file (trivial boundary). Most have some amount of silence at the beginning or end.

Do you mean the extents of the "song" (whatever that might mean)?

Detection of the respective first or last boundary is a truly non-trivial task (e.g., fade-in/out), so to my mind these boundaries should be included in the evaluation.

Yes, I absolutely agree. Maybe a compromise here is to run through the data and figure out which tracks have leading silence, and only for those do start/end trimming? It's methodologically ugly to mix results in this way, but it would work.

If you're worried about consistency, I'd recommend mangling the salami data by padding the audio (and annotations) with begin/end silence so that all tracks are on even footing, and the scores can be reported in a unified way.

Of course, if the temporal tolerance is high enough (e.g., 3 seconds) most of these transitions will practically coincide with file extents, but the choice of tolerance is another issue.

2) The number of boundaries for many pieces is really low. Stripping two of them voluntarily increases statistical variance considerably. For some pieces there might even not be any other annotated boundaries than just these.

A couple of thoughts here:

1) If start and end are retained, everyone gets two hits for free. This does mean that variance is reduced, but it also increases bias. (And, as we note in the mir_eval paper, it reduces the power of comparisons by artificially narrowing the effective range of scores.) Since the contribution to the score depends on the number of total boundaries, which varies across tracks, correcting for this bias post-hoc is exceedingly tricky. I think it's better to just drop these trivial boundaries from the evaluation up-front.

1.5) Dropping the "trivial" boundaries has to be done within the metric, and not as a pre-processing step of the annotations. This is because mangling the annotations A) won't be consistent across different estimators, and B) structure annotation metrics (pairwise-f, nce) need the information encoded in those boundaries, and are not subject to the triviality bias.

2) If there are no non-trivial boundaries in the piece, why are we evaluating on it? I think even if a method does poorly in this use-case, no real-world user would care very much.

craffel commented 9 years ago

Any further comments/action items on this, or can I close?