Closed schoffelen closed 1 year ago
Hi @schoffelen that's a good point - I don't have time during the week but will have a look into the code in the weekend and get back asap!
I've looked through your code, looks like a great start. Just couple of comments:
replacenan
-> replace_nan
for readability, or alternatively impute_nan
(see below)What do you think about the following more generic approach:
pparam.imputation_dimension
: dimension(s) along with imputation is performed. E.g. imagine [samples x electrodes x time] data. If imputation_dimension = 3
imputation is performed across time, that is, other time points (within a given trial and electrode) would be used to replace the nans.pparam.method
can be forward
(earlier elements in the array are used eg [4, 6, NaN, NaN 2, 1] becomes [4, 6, 6, 6, 2, 1]), backward
([4, 6, NaN, NaN, 2, 1] becomes [4, 6, 2, 2, 2, 1] ), nearest
( [4, 6, NaN, NaN, 2, 1] becomes [4, 6, 6, 2, 2, 1]). The method in your code could be called random
or such Thanks for the directions Matthias. Glad that it is not a stupid idea altogether :). I will familiarize myself with imputation, and update for a next round of comments... Indeed, currently most use cases contain NaNs for all features in a given sample, but that does not need to be the case, I'll make it more flexible.
I am pinging @kdong96 on this PR, as an important stakeholder for the functionality
Thanks @schoffelen and @KDong96. Very happy to write part of the code btw. So if you want you could focus on the cases relevant to Donders and I could add in some generic stuff (like forward/backward imputation) to cover the general case.
I implemented forward/backward impute per definition above and some corresponding unittests. It doesn't address your specific usecase though I wanted to sync before looking into that.
Do you have an idea how to address it more generically (and how to name the method)?
Thanks for working on this @treder, much appreciated! I tried to search a bit for common ways for imputation (beyond the reference you sent around earlier). The most common term I encountered is Hot Deck. Although it seems as if backward and forward are also special cases of hot deck, generically, it may refer to random replacement. Might that be a good name?
With respect to code specifics: In looking more and more at your code, I think I start to understand details of this your generic coding strategy more and more. Yet, I am not sure whether I am already capable of suggesting details that will get the Treder stamp of approval. My two cents are that a generic implemention would probably draw at random across the sample dimension, for each feature (i.e. of course not mixing across features), which is more or less what my suggested code was doing, no?
Since hot deck is the overarching approach, perhaps stick with the methods forward
, backward
and random
? Few changes:
isfinite
instead of isnan
like you did in your first draft. Good to catch the nasty inf
tooFor method=random
:
param.use_clabel
is an optional parameter specifying whether you want to impute class-aware or not (e.g. for regression tasks you'd want to turn use param.use_clabel=0
)nan
part of the data is replaced, not all of it (unless everything is nan of course)clabel
) so the train and test set are independent anyway.Let me know what you think. Btw sorry for being difficult, I try to keep the code as generic as possible partly to keep it simple and not lose track of the codebase - which I did some time ago anyway ;)
Btw, I think the code is ready for merge. The question whether it addresses your original problem sufficiently.
Lastly, the question regarding nan-robust metrics is slightly different from that of imputation, perhaps best addressed in a separate PR at some point?
Great, thanks a lot @treder. I suggest to merge it, so that we can start playing around with it locally. I think that our initial use case is covered BTW. And indeed I agree that nan-robust metrics is related, but a different story. Thanks for your help!
Perfect, I've just merged the PR - thanks again for raising this!
Hi @treder Matthias! I increasingly encounter user scenarios here at the Donders, where people want to train (and test) classifiers with missing data, represented by NaNs. I think that part of this challenge can be handled relatively generically by means of an appropriate preprocess function. Here's a first attempt. The performance metric computation also needs to be adjusted, but that may require some more thought, for now I only adjusted the accuracy computation.