mir-evaluation / mir_eval

Evaluation functions for music/audio information retrieval/signal processing algorithms.
MIT License
616 stars 116 forks source link

Implement more separation metrics #68

Open craffel opened 10 years ago

craffel commented 10 years ago

Although bss_eval is pretty standard, there are more: http://bass-db.gforge.inria.fr/peass/ http://labrosa.ee.columbia.edu/projects/snreval/

faroit commented 8 years ago

What about also supporting the multichannel measures?

craffel commented 8 years ago

Good idea!

carlthome commented 8 years ago

PEASS would be a welcome addition, I think.

faroit commented 8 years ago

Back to the state of bss_eval, I want trigger some discussion about what we could contribute to mir_eval and what would be the possible use cases, for me that would mainly be the SISEC Evaluation organised by @aliutkus this year.

Current implementation

Issues/Missing features

Multichannel support

Since many separation methods are now evaluating at least stereo mixes, supporting the Image to Spatial distortion Ratio would increase the usability of mir_eval a lot, since the current SISEC dataset is stereo and in evaluated using bss_eval_images

My student @ErikJohnsonCU and I have a working bss_eval_images pure python implementation which we could make ready for a PR into mir_eval. However, one of the issues that would need to be solved first, would be:

Frame wise evaluation

The SISEC matlab based evaluation evaluates the audio signal frame wise instead of globally for the whole signal. We could add this to both the existing sources as well as to the new images. This would solve #192.

Speed

bss_eval_sources.m is slow. The mir_eval version is even slower. Would you be interested in a cython, numba... optimised version or would you prefer to have as few external dependencies as possible, hence simpler to install?

DSD100

This year we will also have some python tools for dsd100 which parses the dataset and makes it easy for python users to participate in SISEC. However we cannot use mir_eval for the official evaluation since multichannel support is missing. Therefore it would be really cool if we can get the images support into within the next few weeks ;-)

carlthome commented 8 years ago

Frame wise evaluation

Yes, please! :+1:

bss_eval_sources.m is slow. The mir_eval version is even slower. Would you be interested in a cython, numba... optimised version or would you prefer to have as few external dependencies as possible, hence simpler to install?

This would be awesome. It's really dreadfully slow at the moment.

dawenl commented 8 years ago

I remember back then there was a discussion of whether using the slower scipy.signal.fftconvolve or faster customized implementation. But numba seems a little too much (in terms of dependency/compatibility), maybe cython is a good middle ground?

craffel commented 8 years ago

Thanks for this discussion!

My student @ErikJohnsonCU and I have a working bss_eval_images pure python implementation which we could make ready for a PR into mir_eval.

Great! A PR is welcome.

how to do the unit tests regarding the test vectors which are mono. I would propose not add more test vectors to the repository, but instead generate fake stereo files from the existing mono files.

That sounds fine to me.

We can also add new json files for matlab compatibility comparisons

Usually we do Matlab (or other pre-exising implementation) comparison in the PR, and don't include it directly in the repository. All of the json data is actually for regression testing.

@craffel how would you add the bss_eval_images? as a separate function or make a wrapper so that users can decide if they want the evaluate the separation on sources or images?

I'm not 100% clear on what the API should look like. How would calling bss_eval_images it be different from calling bss_eval_sources? Could we just add a flag to bss_eval_sources or are they very different functions? Sorry, SS eval is not an area I know a lot about! I am happy to defer to community judgement here.

The SISEC matlab based evaluation evaluates the audio signal frame wise instead of globally for the whole signal. We could add this to both the existing sources as well as to the new images. This would solve #192.

Cool.

bss_eval_sources.m is slow. The mir_eval version is even slower. Would you be interested in a cython, numba... optimised version or would you prefer to have as few external dependencies as possible, hence simpler to install?

My impression was that most of the slowness was due to the fact that a large FFT was computed. Would simply using frame-wise help quite a lot? What would you envision being Cythonized/numba-ized? I personally am a huge advocate for numba over Cython, but in the case of mir_eval we usually try to make the dependencies as common and simple to install as possible - unfortunately, numba doesn't satisfy that.

Therefore it would be really cool if we can get the images support into within the next few weeks ;-)

That would be great! Let's work towards that.

aliutkus commented 8 years ago

Dear all,

my impression is that we should just drop bss_eval_sources. There is a lot of controversy about that one, that actually compute scores based on a filtered version of the true sources, hence modifying the reference signal depending on its estimation.

My feeling is that the source separation field is anyway going in the direction of stereo signals and that evaluating image separation is sufficient there. Indeed, the problem of evaluating a stereo-to-mono method that would go from the image to the original mono source is in my opinion a somewhat different problem

Hence, I would advise to just stick to bss_eval_images there

carlthome commented 8 years ago

@aliutkus, I really don't think stereo signals should be expected. In particular for digital speech enhancement/denoising tasks a single-channel setting is still common. There's a strong use case for mono recordings in real world applications in terms of data bandwidth and so on, particularly for cellphones. Even the iPhone defaults to recording mono audio files, in fact.

For example (source):

[...]we provide a systematic comparison of single-channel and two-channel methods, showing that RNN-based single-channel enhancement can yield a recognition performance that is on par with the previous best two-channel system, and at the same time is complementary to two-channel pre-processing.

However, in my experience bss_eval_sources is a bit easy to fool and you could get pretty decent SDR by just bandpassing the mixture, for example. This is mostly a problem due to global SDR being used, I believe. If there were local measures in mir_eval for estimating a lower bound (e.g. what's the worst source-to-distortion ratio we'd expect to get in a separation?) it would not be as much of a problem. So, if mir_eval could support doing boxplots like the SiSEC 2015 paper for showing SDR, SIR, SAR variance, then I'd be very happy. :tada:

craffel commented 8 years ago

From a broader perspective, in mir_eval we try to include all metrics that are commonly used in the community,even if they are suboptimal. For example, the "goto" metric for beat evaluation is informally accepted as being not terribly useful, but we still include it since it's still used in MIREX and in misc papers. Even if bss_eval_sources is eventually completely abandoned by the community, it's still useful as a point of comparison to prior work. All that being said, I totally encourage discussion of the shortcomings of a metric to be documented in the function's docstring.

aliutkus commented 8 years ago

Yo, You're of course right when you're mentioning that there's a lot of work in single channel separation. Actually, I was making an oversimplification here. The point is rather on whether we are evaluating over the "sources" or the "images" as references. This can happen in both single channel and multichannel separation. Let I denote the number of chanels of the mixture (I=1 for mono, I=2 for stereo). I'll just remind the difference:

In bss_eval_images, you are directly comparing the reference (I channels each) to the estimated objects (I channels each) In bss_eval_sources, you want to compare in some way the unknown mono source references with the mono estimated sources. In the single channel case, you could think these yield the same estimates, but it is not so: there is this idea in bss_eval_sources that the sound you provide as ground truth has been filtered before yielding the image against which you should be comparing. You are then arbitrarily assuming that this filter is identical to the best linear filter that goes from the provided reference to the estimate. This is the thing that is arguable. It means that the reference sound against which you are comparing is modified according to the test sound each time bss_eval_sources is called. For one given example, two methods are hence not compared to the same ground truth!

In practice, you can get infinite SDR with bss_eval_sources by just zeroing frequencies: the strategy would be: just put 0 for all bins at the frequencies you're unsure of, and just put the mix on some others: the reference will be annihilated for your 0 frequencies and you should get a pretty good score for no reason... This is not really a matter of using a global or local metric in my opinion, but rather of modifying ground truth before comparison. You could get time-dependent metrics for both bss_eval_images and bss_eval_sources

What I wanted to say was hence not really a matter of dropping mono evaluations, but rather of dropping this estimation of the best filter that is in bss_eval_sources. Anyways, most corpora I am aware of are actually providing the images of the objects: the mix is the simple sum of them. This makes the need of estimating the "mixing filter" useless and pretty dangerous

faroit commented 8 years ago

@craffel

Usually we do Matlab (or other pre-exising implementation) comparison in the PR, and don't include it directly in the repository. All of the json data is actually for regression testing.

alright

I'm not 100% clear on what the API should look like. How would calling bss_eval_images it be different from calling bss_eval_sources? Could we just add a flag to bss_eval_sources or are they very different functions? Sorry, SS eval is not an area I know a lot about! I am happy to defer to community judgement here.

So @aliutkus made clear that bss_eval_sources and bss_eval_images do have different output even on mono channels, they should move to separate functions. The question is how the evaluate wrapper function should look like. Adding a method argument like method='[images, sources]' ?

Also about the framewise evaluation metric of both functions: shall we just add something like scores['Source to Distortion (framewise)'] which then would be a nested list of nsrc, nframes?

craffel commented 8 years ago

What I wanted to say was hence not really a matter of dropping mono evaluations, but rather of dropping this estimation of the best filter that is in bss_eval_sources.

Ok, I think I see what you're getting at. From a high-level perspective though, mir_eval's goal is to implement metrics as they are commonly used, unless there is a fundamental "bug" which we can agree on. If this is actually an issue with evaluation, we can discuss removing it (maybe in a separate issue!) or, we can add a flag to make it optional. Thanks for your discussion.

The question is how the evaluate wrapper function should look like. Adding a method argument like method='[images, sources]' ?

Generally, evaluate should call all metric functions no matter what - if a user only wants to compute a few metrics, they can just write a little Python code snippet to do exactly that.

Also about the framewise evaluation metric of both functions: shall we just add something like scores['Source to Distortion (framewise)'] which then would be a nested list of nsrc, nframes?

That seems reasonable at a first pass. One issue is that usually metrics in mir_eval report a single aggregated score, rather than a bunch of local scores. I doubt we could agree on a few statistics which would be sufficient to characterize this information though (e.g. median and percentiles). We can hash this out a little later in the PR.

carlthome commented 8 years ago

Thanks @aliutkus for that great clarification. :heart: With the distinction between images and sources I understand that evaluating images is much more sensible, especially philosophically what with stereo miked pianos, guitars, reverberation and so on, and drum kits with their ambiguous polyphonicity like you mentioned.

carlthome commented 8 years ago

@craffel:

That seems reasonable at a first pass. One issue is that usually metrics in mir_eval report a single aggregated score, rather than a bunch of local scores. I doubt we could agree on a few statistics which would be sufficient to characterize this information though (e.g. median and percentiles). We can hash this out a little later in the PR.

For the simplest aggregate with the most useful information, I'd really like to push for minimum and an average as in https://github.com/craffel/mir_eval/issues/192 (median would be fine, central limit theorem be damned). Hopefully, in the future, an analytical lower bound should be expected as well (hard), but an estimated minimum on a test set is still very useful.

...and maybe max just for symmetry, but it's not interesting.

Whenever a mean is reported I expect the standard deviation as well, but I know I'm in the minority.

Js-Mim commented 8 years ago

@faroit , @ErikJohnsonCU

I would be definitely interested in that pure python implementation of bss_eval_images. Any updates on that?

Thanks, S.

faroit commented 8 years ago

@Js-Mim yes, actually it is already finished, but we are waiting for #199 to be ready, so that we get the interface right. Also @ErikJohnsonCU is still looking if we can speed up the pure python version... But it should just be a matter of 1-2 weeks... hold on ;-)

Js-Mim commented 8 years ago

@faroit thanks for your fast reply. Let me know if you need a hand over the optimization. Cheers, S.

faroit commented 8 years ago

@Js-Mim thanks. yeah, maybe we add a pure python version here and highly optimised version in dsdtools where we already know that processing these 5 min tracks take ages to compute...

justinsalamon commented 5 years ago

SDR - half-baked or well done? proposes a scale-invariant version of SDR (SI-SDR) that's more perceptually robust. Probably worth adding at some point?

faroit commented 5 years ago

@justinsalamon we discusses this among some people in the audio separation community and we decided that it makes more sense to develop future bsseval modules outside of mir_eval since the communities do not overlap too much. Also @craffel could probably verify that the separation module here is subject to lot of issues on the tracker ;-)

for now, bsseval v4 was developed inside of museval but I just forked it a couple of days ago with the intention to strip down just the actual metrics - see discussion here and ... to also add SI-SDR... yay

@craffel once things are settled we can of course add these things back in here.

justinsalamon commented 5 years ago

we decided that it makes more sense to develop future bsseval modules outside of mir_eval since the communities do not overlap too much.

Hmm, I'm not sure I agree with that statement. I've read several papers (recently) using mir_eval to evaluate BSS systems (rather than the matlab based bss_eval), and I know everyone on the thread you linked to with the discussion, suggesting there's significant overlap between the communities :) Without being facetious though, the point is that even if these are separate communities there's no inherent reason why bss_eval couldn't continue to be part of mir_eval, with the BSS community contributing to it.

Perhaps more importantly, there's the question of scope and community effort. At some point in the past a decision was made that BSS is within the scope of mir_eval, and should thus be supported. I don't know that there's been an explicit decision that this is no longer the case? As long as BSS is considered within the scope of mir_eval, it should be supported (and updated/maintained). All this to say, forking the BSS metrics out of museval doesn't automatically make a decision for mir_eval, but it definitely merits a discussion (here).

The other question is community effort - mir_eval already has a relatively large user base and community of developers contributing back to it, which the bss_eval metrics also benefit from. Carving bss_eval out would split development efforts. Basically, I don't think it makes much sense for there to be two separate implementations of BSS metrics in python (mir_eval and a separate bss_eval), as these will diverge over time and cause the very problems mir_eval was meant to overcome.

...all this to say, I think the BSS eval metrics should live in one place (as far as a python implementation is concerned). Up to now this place was mir_eval, but if there's motivation and agreement on migrating it outside of mir_eval then I believe the effort should be coordinated between mir_eval and bss_eval rather than decision-by-fork-making, to avoid duplicate efforts and competing implementations.

As for the question itself, I don't have a strong opinion in either direction. @craffel ?

craffel commented 5 years ago

I agree that it would make sense for Python bss-eval-ish metrics to all live in one place. I'm happy to have that be within mir_eval, and I don't see it as a huge burden to have conversations and code review around it happen in mir_eval, but if it makes more sense for it to live elsewhere I am fine with that too.

faroit commented 5 years ago

Hmm, I'm not sure I agree with that statement. I've read several papers (recently) using mir_eval to evaluate BSS systems (rather than the matlab based bss_eval), and I know everyone on the thread you linked to with the discussion, suggesting there's significant overlap between the communities :) Without being facetious though, the point is that even if these are separate communities there's no inherent reason why bss_eval couldn't continue to be part of mir_eval, with the BSS community contributing to it.

Yes, sorry, you are right, I should not have used such an argument for or against having bsseval in mir_eval, even though I still believe that the overlap is quite small when you also include the speech separation community for which bsseval also is one of the mainly used metrics. In any case, this was not even the reason why we created bsseval v4 outside of mireval. In fact, we proposed this here first (#272) but then we found it awkward to include this metric without any reference and user feedback. We therefore decided to use v4 directly in the SiSEC 2018 challenge. Since then we got plenty of feedback and some of this was the wish to have a separate bsseval package.

Carving bss_eval out would split development efforts. Basically, I don't think it makes much sense for there to be two separate implementations of BSS metrics in python (mir_eval and a separate bss_eval), as these will diverge over time and cause the very problems mir_eval was meant to overcome. Up to now this place was mir_eval, but if there's motivation and agreement on migrating it outside of mir_eval then I believe the effort should be coordinated between mir_eval and bss_eval rather than decision-by-fork-making, to avoid duplicate efforts and competing implementations.

v4 was not a fork but a major rewrite that addressed some of the issues with v3 and also the old MATLAB implementation. We tried to make it clear that this is different than the MATLAB/mir_eval implementation and people could deliberately choose one or another (we could see in recent papers that this seemed to work out).

In any case, I agree that there should not be competing implementations and it should be clear for the community where to find python source separation evaluation metrics. I would be totally fine to put bsseval v4 and future development into mireval if this is what the (source separation) community would like and where they would like to contribute – oh and if #269 would be addressed ;-)

Ping @Jonathan-LeRoux @pseeth @hagenw @aliutkus @TE-StefanUhlich

Jonathan-LeRoux commented 5 years ago

I certainly don't want to step on anyone's toes, and I'm thankful for everyone's efforts to create a robust and complete suite of evaluation metrics, but the main advantage I see in having a separate module for separation metrics is that source separation people like me may not think of mir_eval as an obvious place to look for such metrics, if only for its name. Would it make sense to have mir_eval depend on an external bss_eval-like module for separation metrics? You guys are much more knowledgeable than I am on managing large projects, so I'd be interested in hearing your opinion on pros and cons.

bmcfee commented 5 years ago

My $2e-2: i think it makes sense to have a self-contained bsseval package and either deprecate the implementation in mir_eval, or refactor it to depend on the bsseval implementation. A few arguments for this:

  1. Google-ability, as mentioned above.
  2. It's the only functionality in mir_eval that depends on audio signal processing (not counting sonification), and has always been a bit of an odd one out.
  3. Potentially wider adoption / more maintainers. The bss eval metrics in mir_eval are somewhat baroque (to maintain compatibility with the matlab package), and AFAIK we don't really have a dedicated maintainer for that at this point. I've spent enough time in the guts of this module to know what a pain it is to get right (BLAS dependencies, anyone?), so anything we can do to increase participation / maintainership is a good thing IMO.
craffel commented 5 years ago

I have similar thoughts as Brian on this.

justinsalamon commented 5 years ago

Looks like there's agreement about making bsseval self-contained? The next question would be, then, whether the mir_eval implementation should be deprecated or refactored to wrap around bsseval.

I think the latter would be nice (this way mir folk can continue to use their existing eval pipelines via mir_eval), but curious to hear whether there are any counter arguments to this option?

faroit commented 5 years ago

Thanks for your feedback. I think wrapping around bsseval makes sense. The current museval is actually quite close to the eval pipeline of mir_eval with respect to the metric artifacts. We then would include a mir_eval api compatibility layer so that using bsseval could be a drop-in replacement for mir_eval.separation.evaluate.

Now we need to work on bssevalto match the rigor of mir_eval with respect to testing and code coverage – especially regression tests. I will get back to you here when we are done.