Benchmarking methods: data simulation

lkorczowski commented 4 years ago

dealing with the increasing number of methods

So we have plenty of great ideas for implementing some data analysis tools and use our features.

However, we reached also some interesting limitations on some situations, e.g.

Detecting electrode deconnection, see #16 , #23 (or other artifacts)
Detecting bad electrode

There are an amazing amounts of methods that could deal with those problem. However, what we are interested, instead of applying a method and hoping it works, to have continuous validated learning, i.e. being able to quickly know if our method works and quantify the success.

my proposal:

each described issue should have a quantified effect on our exiting pipeline (e.g. "we observe that more than 50% of the electrode deconnections are classified as bursts").
each issue should have a least of hypothesis (e.g. "electrode deconnections increase instantaneous power to over 100 times the baseline which is higher than the burst threshold of 2")
we give a short list of solutions with expected results (e.g. "the amplitude thresholding classifier should have also an higher threshold and will reject any amplitude higher than 20 times the baseline")
we should associate a simple benchmark to this issue that will return the concrete result ("e.g. from 50% of the electrode deconnections, only 5% are now missclassified as burst without modifying the burst detection")
all methods should be automatically benchmarked when developed. We should be able to tell at any given time how they are doing against competing methods if they needs to be developed.
one simple way to do that will be to apply very simple data simulation and using a pytest-like scheme to bechmarked the related methods. MOABB is a very good example of the kind of thing that might be great (but it is case it is a bit overkilled).

why making this so much complicated?

I understand that it seems completely overkill to develop such complicated framework to develop ONE feature, however let me give my experience:

I developed dozen or even hundred of artefacts cleaning/rejecting/cancelling methods in my career. While most looks very nice and very fancy when I was working on then, none stayed for long. Why ? Was I developing always better and better methods ? Probably not... The sad reason is because I had no way to compare then easily (without spending hours of data comparison). So it was easier for me to just say "oh well, it works nicely. It looks better than the previous method".
we need to end that cycle of developing tones of fancy methods that are no clear way to be bench marked. We need to know exactly why and how our methods influence our pipeline.
I believe that if we have this architecture of continuous benchmarking, each method developed will be truly useful for a long time and won't be rotting somewhere after we published.

RobinGuillard commented 4 years ago

Ok I approve !

lkorczowski commented 4 years ago

won't do it here, out of scope of this project

lkorczowski / Tinnitus-n-Sleep