ENH - first (WIP) commit to support NaNs in input data

schoffelen commented 1 year ago

Hi @treder Matthias! I increasingly encounter user scenarios here at the Donders, where people want to train (and test) classifiers with missing data, represented by NaNs. I think that part of this challenge can be handled relatively generically by means of an appropriate preprocess function. Here's a first attempt. The performance metric computation also needs to be adjusted, but that may require some more thought, for now I only adjusted the accuracy computation.

treder commented 1 year ago

Hi @schoffelen that's a good point - I don't have time during the week but will have a look into the code in the weekend and get back asap!

treder commented 1 year ago

I've looked through your code, looks like a great start. Just couple of comments:

I suggest to name replacenan -> replace_nan for readability, or alternatively impute_nan (see below)
the current approach fixes NaN's in samples (essentially whole samples/trials missing). Is this the most common type of NaN you encounter? What about nans for certain timepoints within a trial, or nans in certain electrodes and some time points (eg electrodes dropping out for a short time)? Ideally we have an algorithm that deals with all kinds of NaNs consistently and effectively
In statistics this is called imputation and there's a number of existing approaches (see eg Scikit Learn's imputation approach). I would prefer to use imputation approaches that are used across the literature and stats packages rather than coming up with something too bespoke.
Adapting all the metrics could be tricky too, maybe we defer this discussion until we solved how to deal with Nans

What do you think about the following more generic approach:

pparam.imputation_dimension: dimension(s) along with imputation is performed. E.g. imagine [samples x electrodes x time] data. If imputation_dimension = 3 imputation is performed across time, that is, other time points (within a given trial and electrode) would be used to replace the nans.
pparam.method can be forward (earlier elements in the array are used eg [4, 6, NaN, NaN 2, 1] becomes [4, 6, 6, 6, 2, 1]), backward ([4, 6, NaN, NaN, 2, 1] becomes [4, 6, 2, 2, 2, 1] ), nearest ( [4, 6, NaN, NaN, 2, 1] becomes [4, 6, 6, 2, 2, 1]). The method in your code could be called random or such

schoffelen commented 1 year ago

Thanks for the directions Matthias. Glad that it is not a stupid idea altogether :). I will familiarize myself with imputation, and update for a next round of comments... Indeed, currently most use cases contain NaNs for all features in a given sample, but that does not need to be the case, I'll make it more flexible.

schoffelen commented 1 year ago

I am pinging @kdong96 on this PR, as an important stakeholder for the functionality

treder commented 1 year ago

Thanks @schoffelen and @KDong96. Very happy to write part of the code btw. So if you want you could focus on the cases relevant to Donders and I could add in some generic stuff (like forward/backward imputation) to cover the general case.

treder commented 1 year ago

I implemented forward/backward impute per definition above and some corresponding unittests. It doesn't address your specific usecase though I wanted to sync before looking into that.

Do you have an idea how to address it more generically (and how to name the method)?

schoffelen commented 1 year ago

Thanks for working on this @treder, much appreciated! I tried to search a bit for common ways for imputation (beyond the reference you sent around earlier). The most common term I encountered is Hot Deck. Although it seems as if backward and forward are also special cases of hot deck, generically, it may refer to random replacement. Might that be a good name?

With respect to code specifics: In looking more and more at your code, I think I start to understand details of this your generic coding strategy more and more. Yet, I am not sure whether I am already capable of suggesting details that will get the Treder stamp of approval. My two cents are that a generic implemention would probably draw at random across the sample dimension, for each feature (i.e. of course not mixing across features), which is more or less what my suggested code was doing, no?

treder commented 1 year ago

Since hot deck is the overarching approach, perhaps stick with the methods forward, backward and random? Few changes:

I use isfinite instead of isnan like you did in your first draft. Good to catch the nasty inf too

For method=random:

param.use_clabel is an optional parameter specifying whether you want to impute class-aware or not (e.g. for regression tasks you'd want to turn use param.use_clabel=0)
only the nan part of the data is replaced, not all of it (unless everything is nan of course)
right now imputation is independent for train and test data. I think this makes sense because random impute is typically done across samples (otherwise you can't use the clabel) so the train and test set are independent anyway.

Let me know what you think. Btw sorry for being difficult, I try to keep the code as generic as possible partly to keep it simple and not lose track of the codebase - which I did some time ago anyway ;)

treder commented 1 year ago

Btw, I think the code is ready for merge. The question whether it addresses your original problem sufficiently.

Lastly, the question regarding nan-robust metrics is slightly different from that of imputation, perhaps best addressed in a separate PR at some point?

schoffelen commented 1 year ago

Great, thanks a lot @treder. I suggest to merge it, so that we can start playing around with it locally. I think that our initial use case is covered BTW. And indeed I agree that nan-robust metrics is related, but a different story. Thanks for your help!

treder commented 1 year ago

Perfect, I've just merged the PR - thanks again for raising this!

treder / MVPA-Light

ENH - first (WIP) commit to support NaNs in input data #36