rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.
https://rasbt.github.io/mlxtend/
Other
4.82k stars 854 forks source link

Add option to drop NaNs in SequentialFeatureSelector #134

Open jlopezpena opened 7 years ago

jlopezpena commented 7 years ago

I am trying to use SequentialFeatureSelector on a dataset where the number of available features is on the same order of magnitude as the number of samples. The dataset has lots of missing values (NaNs) that cannot be imputed from other samples: they simply don't make sense in some cases.

I can obviously drop NaNs before feeding anything to the feature selector, but this needlessly reduces the number of available data points, as the missing values don't happen always on the same rows, and I'd never want to fit my model using all columns at the same time.

A way around this problem would be to allow the NA dropping to (optionally) happen within the ColumnSelector.transform. In this way, I'd only be dropping the rows for which there are NAs in the specific columns that are needed for a test. This, however, breaks the sklearn API, as it would require the transform method to modify also the target vector y, so it seems I cannot just add a custom transformer to drop NA's to the base estimator.

An alternative solution could be to hard-code the dropna within the SequentialFeatureSelector._calc_score method, calling it before using cross-validation (find the row indices that contain NAs for the selected columns, then slice X and y by those rows before calling the scoring function). Would this be an acceptable/desirable change? I can put together a quick implementation if you think it is worth it.

rasbt commented 7 years ago

Thanks for sharing your ideas on this! So, just to make sure that I understand correctly: You have a dataset where samples have NaN's in certain features. However, you don't want to remove samples (rows) that contain a NaN in a feature upfront (before feeding it to the pipeline) because you don't know yet if the feature it has a NaN in matters or not? I am not sure, but I am a bit hesitant with adding more complexity to the feature selection algo itself; I'd much rather prefer to have a more flexible pipeline wrapper. And in a application sense, would feature imputation maybe a solution to that problem? If you assign some sort of "average" value for a given feature before feature selection so that it doesn't have much weight?

jlopezpena commented 7 years ago

Apologies for the very late reply.

It seems my problem is quite unique. I have a huge number of features (compared to the dataset size), almost all of them have missing values, the locations of the missing values are not distributed homogeneously, and feature imputation is not straightforward as there are features with very particular meaning that cannot simply be averaged.

Now, some of the features are only present in a small subset of the dataset, but happen to be very informative for that subset. I want my feature selection to figure that out, and then perhaps train an "expert subsystem" specializing on it. In some other cases, I will want to single out features that are worth imputing. As I mentioned, imputation is not straightforward and would need to be done on a one by one basis, but I'd like to know there is value in a feature before any time is spent on filling the missing values.

I had my own pipeline for sequential selection built with these things in mind, and was wondering if I could use your implementation instead. The root of my problem seems to be this limitation with scikit-learn: https://github.com/scikit-learn/scikit-learn/issues/4143

I can work my way around it for the time being, when that sklearn issue gets sorted out the solution will be as simple as implementing a DropNA transformer.

rasbt commented 7 years ago

No worries!

I must say that your problem sounds like a very interesting one. I don't think it's unique; I'd even say that it's a very common one! However, the challenge is, I guess, that it can't be so easily automated, because it needs a unique treatment (like the column imputation you mentioned).

The part that could be automated though is sth like sampling different training subsets with different columns that don't contain missing values, and dropping all the others. Maybe an approach would be to train models on different training subsets, and then analyzing the results where a minimum number of samples are present, and from there, one can look further which feature columns may be worth imputing at all (vs throwing out potential noise features to shrink the combinatorial search space.

However, I would say that dropping NaN's shouldn't be done in the feature selection algorithm or the learning algorithm itself, but it would be better to write (yet another) wrapper around it. The advantage would be that this wrapper would/could work with different kinds of algorithms so that such a "feature" wouldn't have to be injected in each individual selection or learning algorithm. Maybe, this would make a fun weekend project :)

jlopezpena commented 7 years ago

Totally agree. The separate wrapper is kind of what I am doing on my own code, my biggest issue with it is that I cannot hook it up with the rest of the pipelines.

What would be useful not only for this problems, but also for other stuff, is a more flexible version of the Transformer class, which operates on both X and y (and possibly further arrays, I have some R-based models which use a different signature!) at the same time. Possible applications of this would be:

Biggest challenge is making this work with the different operations in sklearn. If you decide to pt some work on this let me know, I'll be happy to help!

metin-akyol commented 5 years ago

I have the exact same problem as the OP described. That is, I have missing values randomly distributed across features, and dropping missing values from all columns simultaneously before getting to the pipeline deletes observations unnecessarily. An extreme example to illustrates this would be having 1 feature that is sparsely filled. Dropping all NANs rows ahead of time, would mean dropping a lot of fully filled rows for the cases where that 1 particular feature is not even used. Did you guys ever implement a solution to this?

rasbt commented 5 years ago

As far as I am aware, there's no recommended solution to that problem, yet. I just reopened that issue in case someone wants to take a knack at it, because this is an interesting issue and could be useful in practice

metin-akyol commented 5 years ago

Thank you so much!! I agree, I would think this would happen a lot.