Closed grrrr closed 9 years ago
Quite frequently, ground-truth annotations contain leading and/or trailing silence, typically (e.g. in SALAMI) annotated using special segment labels like 'Silence'. When such a special segment label appears at the front, the start time of this segment does not contain a real boundary which boundary detection algorithms would find, conversely for ending times of silent end segments.
Do you mean the start or the end of the first segment?
In a well-formed annotation, the start time should be 0, which I've argued should never be counted as positive hits. Setting trim=True
will discount the first and last segment boundaries: the only reason it's set to False
is backwards compatibility against mirex by default.
If you mean the end time of the first segment, then you're talking about non-trivial manipulation of the annotation data, and I'd be very careful here.
Currently, it seems that the segment.detection function must be tricked by introducing very short silent segments, if those are not already present in the ground-truth, to consistently trim them, or not.
I'm not sure what you mean here: are you talking about modifying the reference annotations? Or the estimated annotations? I don't see why either should be necessary.
Quite frequently, ground-truth annotations contain leading and/or trailing silence, typically (e.g. in SALAMI) annotated using special segment labels like 'Silence'. When such a special segment label appears at the front, the start time of this segment does not contain a real boundary which boundary detection algorithms would find, conversely for ending times of silent end segments.
Do you mean the start or the end of the first segment?
The start of the first segment, or conversely, the end of the last segment.
In a well-formed annotation, the start time should be 0, which I've argued should never be counted as positive hits. Setting trim=True will discount the first and last segment boundaries: the only reason it's set to False is backwards compatibility against mirex by default.
Exactly, but this is difficult terrain. In SALAMI, the two following variations appear equally often, here shown with (start,end,label) triplets:
(0, 0.3, 'Silence'), (0.3, 10, 'A'), .... , (120, 130, 'C'), (130, 131, 'Silence') or (0, 10, 'A'), .... , (120, 131, 'C')
In the first case, the annotator was very careful checking leading and trailing silence (the latter probably being a fade-out), in the second case the annotator was less pedantic.
Trimming both versions in segment.detection yields completely different sets of reference boundaries.
Let's assume that a typical good boundary detector would detect boundary estimates at approximately 0 seconds and somewhere between 130 and 131 seconds, comparing the two trimmed reference sets would result in very different evaluations. Right now, segment.detection is not aware of the fact that a leading or trailing part labeled 'Silence' must be handled differently than a non-silent part.
Currently, it seems that the segment.detection function must be tricked by introducing very short silent segments, if those are not already present in the ground-truth, to consistently trim them, or not.
I'm not sure what you mean here: are you talking about modifying the reference annotations? Or the estimated annotations? I don't see why either should be necessary.
If you look at the two examples, the second one missing leading and trailing silences could be converted to something like (0,0.001, 'Silence'), (0.001, 10, 'A'), .... , (120, 130.199, 'C'),(130.999,131, 'Silence') so that both variations would be approximately the same after trimming.
This sounds a lot to me like a specific problem with SALAMI due to the annotators not following the same annotation rules.
Right now, segment.detection is not aware of the fact that a leading or trailing part labeled 'Silence' must be handled differently than a non-silent part.
Unless this is a widely-agreed upon (across annotators, datasets, researchers, etc) standard, i.e. "When the first or last segment is labeled Silence, ignore it", I think it's outside of the scope of mir_eval
. That is, it seems like a dataset/annotator-specific issue which should be resolved by some data cleaning before mir_eval
is used at all.
It is actually a very general problem with any kind of human annotation, but if this is the way mir_eval should be used, i am fine with it.
Unless this is a widely-agreed upon (across annotators, datasets, researchers, etc) standard, i.e. "When the first or last segment is labeled Silence, ignore it", I think it's outside of the scope of mir_eval. That is, it seems like a dataset/annotator-specific issue which should be resolved by some data cleaning before mir_eval is used at all.
:+1:
This is a pretty nasty rabbit hole, especially if you apply the same logic to segment labels. For instance, the mirex implementation of structural annotation metrics has a bunch of special cases for segment labels to ignore (see here for one such example).
I don't think mir_eval can or should try to solve this problem in general, but rather punt back upstream to annotators/data collectors to be more precise about annotation schemes.
It is actually a very general problem with any kind of human annotation, but if this is the way mir_eval should be used, i am fine with it.
Yes, mir_eval
as a philosophy tries to avoid dealing with annotation issues - i.e., it assumes that the annotations are clean/correct - because this is a realm where researchers can disagree a lot. In order to be as "standard" as possible, we need to leave issues like this up to the annotators/dataset.
I don't think mir_eval can or should try to solve this problem in general, but rather punt back upstream to annotators/data collectors to be more precise about annotation schemes.
OK, closing this. Thanks for bringing this up, Thomas.
In a well-formed annotation, the start time should be 0, which I've argued should never be counted as positive hits.
I wouldn't agree with that. As Thomas said, there may be two different situations. A file could start with silence: 0 1.2 Silence 1.2 10 A or a file could start right away with the song: 0 12.5 A 12.5 20 B
Do you say that getting the boundary at 1.2 for the first file should be counted as a hit, while getting the boundary at 0.0 for the second file should be discarded as trivial? And do you say that every boundary detector should output a boundary at 0.0 for the first song? Note that the same holds for the endings of songs. I would argue that a predicted boundary at 0.0 for the first song should be a false positive, a boundary at 1.2 for the first song should be a true positive, and a boundary at 0.0 for the second song should be a true positive. Alternatively, the boundaries at 1.2 for the first song and at 0.0 for the second song should be ignored as being trivial. Ignoring 0.0 for both the first and the second song is inconsistent.
Yes,
mir_eval
as a philosophy tries to avoid dealing with annotation issues
This is not an annotation issue. Some files do start or end with silence while others don't, and the evaluation should be able to take this into account.
A possible solution would be for us to preprocess the SALAMI ground truth to discard "Silence" segments at the beginning and end and set trim=False
, but from what I've seen, then the evaluation goes wrong because it uses the minimum and maximum of the ground truth to trim or pad the predictions.
Hi all,
The way I approached this problem, without having to modify any reference annotations or the mir_eval implementation, is the following:
As you can see, the first and final boundaries are actually trivial to detect, and therefore mir_eval provides a way to "remove" these boundaries in order to, in theory, have a less biased evaluation.
Does this make sense?
Regardless, very interesting discussion guys!
On Thu, Apr 9, 2015 at 9:22 AM, Jan Schlüter notifications@github.com wrote:
In a well-formed annotation, the start time should be 0, which I've argued should never be counted as positive hits.
I wouldn't agree with that. As Thomas said, there may be two different situations. A file could start with silence: 0 1.2 Silence 1.2 10 A or a file could start right away with the song: 0 12.5 A 12.5 20 B
Do you say that getting the boundary at 1.2 for the first file should be counted as a hit, while getting the boundary at 0.0 for the second file should be discarded as trivial? And do you say that every boundary detector should output a boundary at 0.0 for the first song? Note that the same holds for the endings of songs. I would argue that a predicted boundary at 0.0 for the first song should be a false positive, a boundary at 1.2 for the first song should be a true positive, and a boundary at 0.0 for the second song should be a true positive. Alternatively, the boundaries at 1.2 for the first song and at 0.0 for the second song should be ignored as being trivial. Ignoring 0.0 for both the first and the second song is inconsistent.
Yes, mir_eval as a philosophy tries to avoid dealing with annotation issues
This is not an annotation issue. Some files do start or end with silence while others don't, and the evaluation should be able to take this into account.
A possible solution would be for us to preprocess the SALAMI ground truth to discard "Silence" segments at the beginning and end and set trim=False, but from what I've seen, then the evaluation goes wrong because it uses the minimum and maximum of the ground truth to trim or pad the predictions.
— Reply to this email directly or view it on GitHub https://github.com/craffel/mir_eval/issues/115#issuecomment-91230223.
Do you say that getting the boundary at 1.2 for the first file should be counted as a hit, while getting the boundary at 0.0 for the second file should be discarded as trivial? And do you say that every boundary detector should output a boundary at 0.0 for the first song? Note that the same holds for the endings of songs.
Yes, and yes. There are a few arguments in favor of both.
[0, T_MAX]
, so all annotations for a given track have to cover that range in order to be comparable to the reference. If they don't, we have to pad or trim them before the metrics make any sense. It's therefore in the algorithm's interest to get the start/end time correct so as to avoid pollution from synthetic padding labels.I would argue that a predicted boundary at 0.0 for the first song should be a false positive, a boundary at 1.2 for the first song should be a true positive, and a boundary at 0.0 for the second song should be a true positive. Alternatively, the boundaries at 1.2 for the first song and at 0.0 for the second song should be ignored as being trivial. Ignoring 0.0 for both the first and the second song is inconsistent.
Wait, which one is the reference and which is the estimate? And would you still think 1.2 is trivial if the label was crowd noise
instead of silence
?
I'd say that ignoring 0 for both is entirely consistent, if you accept that silence->nonsilence is meaningful, ie, if you take the first annotation as reference.
Wait, which one is the reference and which is the estimate?
I considered the two examples I gave to be references for two different songs: One that starts with 1.2 seconds of silence, and one that starts with music right away.
And would you still think 1.2 is trivial if the label was
crowd noise
instead ofsilence
?
I don't necessarily think 1.2 is trivial. I thought the idea of trim=True
was to treat silence->nonsilence and nonsilence->silence as not meaningful, because they're easy to detect and have a significant influence on results.
The problem I see comes from confounding two things: The beginning/end of a file, and the beginning/end of the music. In some cases, they coincide, and in some cases they don't. This is why I think these cases need to be distinguished in the evaluation. It's a separate question of whether detecting the beginning/end of the music is meaningful or not (I'd say it is, but it can be instructive to compare algorithms on their performance of detecting boundaries within the music only). "Detecting" the beginning/end of a file should not be relevant for anything.
I don't necessarily think 1.2 is trivial. I thought the idea of trim=True was to treat silence->nonsilence and nonsilence->silence as not meaningful, because they're easy to detect and have a significant influence on results.
No, the idea is really just to suppress the really obvious freebies that arise from the necessity of interval-based annotation. Silence->nonsilence may or may not be trivial, but trim doesn't make that call.
The problem I see comes from confounding two things: The beginning/end of a file, and the beginning/end of the music. In some cases, they coincide, and in some cases they don't.
Quite correct. I'd say that in no case is beginning-of-file important. In the cases where they coincide, I'd say that beginning-of-music is also not meaningful, since there's no contrasting prior observation (eg, at negative time).
"Detecting" the beginning/end of a file should not be relevant for anything.
:+1:
I'm going to defer the actual segmentation discussion to you all because it's not my area, but in response to this
This is not an annotation issue. Some files do start or end with silence while others don't, and the evaluation should be able to take this into account.
As far as I can tell, the issue is that for some annotators in one dataset, a label 'Silence' is included at the beginning or end. If mir_eval
handles this, do we also need to (as @bmcfee pointed out) handle Crowd Noise
? What about Fade-in
? Silent Intro
? Etc. All of these could vary across annotators, datasets, etc. Unless it is truly standardized to include a label Silence
which should be ignored (not silence
or SILENCE
etc), then this is an annotator/dataset-specific cleaning issue, as far as I can tell, and is not within the scope of mir_eval
.
The problem I see comes from confounding two things: The beginning/end of a file, and the beginning/end of the music. In some cases, they coincide, and in some cases they don't. This is why I think these cases need to be distinguished in the evaluation.
I'm seeing this issue - but it's different from what I was referring to above, and I'm going to defer to you all to come up with an agreed-upon solution..!
Hi all, as i am dealing with this issue for the MIREX submission again, some further input:
stripping of leading or trailing known "silent" segments (denoting the absence of any signal) from ground-truth and predictions should be done in a preprocessing step. It seems we agree that this a not in the scope of mir_eval.
I agree that this isn't mir_eval's job; I'm not sure I agree that it should be done though. If, as you say below, the transition from silence to non-silence is meaningful, then shouldn't that information be retained?
the first and last resulting boundary should be included in the evaluation (not stripped). There are two reasons: 1) For the example of SALAMI audio, only a part of the audio files are hard-clipped to the extents of the audio file (trivial boundary). Most have some amount of silence at the beginning or end.
Do you mean the extents of the "song" (whatever that might mean)?
Detection of the respective first or last boundary is a truly non-trivial task (e.g., fade-in/out), so to my mind these boundaries should be included in the evaluation.
Yes, I absolutely agree. Maybe a compromise here is to run through the data and figure out which tracks have leading silence, and only for those do start/end trimming? It's methodologically ugly to mix results in this way, but it would work.
If you're worried about consistency, I'd recommend mangling the salami data by padding the audio (and annotations) with begin/end silence so that all tracks are on even footing, and the scores can be reported in a unified way.
Of course, if the temporal tolerance is high enough (e.g., 3 seconds) most of these transitions will practically coincide with file extents, but the choice of tolerance is another issue.
2) The number of boundaries for many pieces is really low. Stripping two of them voluntarily increases statistical variance considerably. For some pieces there might even not be any other annotated boundaries than just these.
A couple of thoughts here:
1) If start and end are retained, everyone gets two hits for free. This does mean that variance is reduced, but it also increases bias. (And, as we note in the mir_eval paper, it reduces the power of comparisons by artificially narrowing the effective range of scores.) Since the contribution to the score depends on the number of total boundaries, which varies across tracks, correcting for this bias post-hoc is exceedingly tricky. I think it's better to just drop these trivial boundaries from the evaluation up-front.
1.5) Dropping the "trivial" boundaries has to be done within the metric, and not as a pre-processing step of the annotations. This is because mangling the annotations A) won't be consistent across different estimators, and B) structure annotation metrics (pairwise-f, nce) need the information encoded in those boundaries, and are not subject to the triviality bias.
2) If there are no non-trivial boundaries in the piece, why are we evaluating on it? I think even if a method does poorly in this use-case, no real-world user would care very much.
Any further comments/action items on this, or can I close?
Quite frequently, ground-truth annotations contain leading and/or trailing silence, typically (e.g. in SALAMI) annotated using special segment labels like 'Silence'. When such a special segment label appears at the front, the start time of this segment does not contain a real boundary which boundary detection algorithms would find, conversely for ending times of silent end segments. Currently, it seems that the segment.detection function must be tricked by introducing very short silent segments, if those are not already present in the ground-truth, to consistently trim them, or not.