treder / MVPA-Light

Matlab toolbox for classification and regression of multi-dimensional data
MIT License
70 stars 35 forks source link

ENH - first (WIP) commit to support NaNs in input data #36

Closed schoffelen closed 1 year ago

schoffelen commented 1 year ago

Hi @treder Matthias! I increasingly encounter user scenarios here at the Donders, where people want to train (and test) classifiers with missing data, represented by NaNs. I think that part of this challenge can be handled relatively generically by means of an appropriate preprocess function. Here's a first attempt. The performance metric computation also needs to be adjusted, but that may require some more thought, for now I only adjusted the accuracy computation.

treder commented 1 year ago

Hi @schoffelen that's a good point - I don't have time during the week but will have a look into the code in the weekend and get back asap!

treder commented 1 year ago

I've looked through your code, looks like a great start. Just couple of comments:

What do you think about the following more generic approach:

schoffelen commented 1 year ago

Thanks for the directions Matthias. Glad that it is not a stupid idea altogether :). I will familiarize myself with imputation, and update for a next round of comments... Indeed, currently most use cases contain NaNs for all features in a given sample, but that does not need to be the case, I'll make it more flexible.

schoffelen commented 1 year ago

I am pinging @kdong96 on this PR, as an important stakeholder for the functionality

treder commented 1 year ago

Thanks @schoffelen and @KDong96. Very happy to write part of the code btw. So if you want you could focus on the cases relevant to Donders and I could add in some generic stuff (like forward/backward imputation) to cover the general case.

treder commented 1 year ago

I implemented forward/backward impute per definition above and some corresponding unittests. It doesn't address your specific usecase though I wanted to sync before looking into that.

Do you have an idea how to address it more generically (and how to name the method)?

schoffelen commented 1 year ago

Thanks for working on this @treder, much appreciated! I tried to search a bit for common ways for imputation (beyond the reference you sent around earlier). The most common term I encountered is Hot Deck. Although it seems as if backward and forward are also special cases of hot deck, generically, it may refer to random replacement. Might that be a good name?

With respect to code specifics: In looking more and more at your code, I think I start to understand details of this your generic coding strategy more and more. Yet, I am not sure whether I am already capable of suggesting details that will get the Treder stamp of approval. My two cents are that a generic implemention would probably draw at random across the sample dimension, for each feature (i.e. of course not mixing across features), which is more or less what my suggested code was doing, no?

treder commented 1 year ago

Since hot deck is the overarching approach, perhaps stick with the methods forward, backward and random? Few changes:

For method=random:

Let me know what you think. Btw sorry for being difficult, I try to keep the code as generic as possible partly to keep it simple and not lose track of the codebase - which I did some time ago anyway ;)

treder commented 1 year ago

Btw, I think the code is ready for merge. The question whether it addresses your original problem sufficiently.

Lastly, the question regarding nan-robust metrics is slightly different from that of imputation, perhaps best addressed in a separate PR at some point?

schoffelen commented 1 year ago

Great, thanks a lot @treder. I suggest to merge it, so that we can start playing around with it locally. I think that our initial use case is covered BTW. And indeed I agree that nan-robust metrics is related, but a different story. Thanks for your help!

treder commented 1 year ago

Perfect, I've just merged the PR - thanks again for raising this!