scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
60.03k stars 25.39k forks source link

correlation feature selection? #13405

Open amueller opened 5 years ago

amueller commented 5 years ago

I wonder whether we should add unsupervised feature selection using correlation. It's a strategy that's very common in statistics and might just have it for completeness. It's certainly more useful than the variance based one, which I think is useless for any threshold != 0.

AbhishekBabuji commented 5 years ago

Can I take this? My kneejerk thinking on how to approach this would be to scan features that are numerical, filter out the features that fall below a certain threshold. If you feel this is not something that a beginner can do, then I'll sit it out. I've seen the code base and I have an understanding of how you check for the correct datatype (array like etc), and I know I would add this code along with the other parts where Feature Selection Section exist.

hermidalc commented 5 years ago

I already do this using the cor() function in R and exposing via rpy2 to sklearn. Feature selection method should let you choose from pearson (default), spearman, or kendall algorithms for example and also require a cutoff threshold for correlation (or anti-correlation). The same could be done with scipy.

thomasjpfan commented 5 years ago

If two features are correlated (above a certain threshold), what would be a good deterministic way to decide which one to keep? The simplest solution would be to keep the feature with the smaller column index.

hermidalc commented 5 years ago

If two features are correlated (above a certain threshold), what would be a good deterministic way to decide which one to keep? The simplest solution would be to keep the feature with the smaller column index.

I think this should always be left to the user and their particular problem. For some problems you do not want to remove correlated features and for some you want to select a representative one to remove redundancy.

I think this new functionality should calculate the pairwise correlation values between features as well as their associated p-values and then allow the user to select/filter features based on thresholds from these. Then we might add additional options which will allow additional filtering for different use cases.

amueller commented 5 years ago

@thomasjpfan I think the standard is to drop the feature that's most correlated with the remaining set of features.

TimothyWillard commented 5 years ago

For some problems you do not want to remove correlated features and for some you want to select a representative one to remove redundancy.

Perhaps it might be worth considering an option that could be passed to the correlation feature selection class that would tell it to exclude certain columns from consideration.

Also, would this include the ability to pick how the correlation coefficient is calculated, e.g. use a nonparametric method like Kendall tau instead of Pearson's correlation?

rick2047 commented 5 years ago

If nobody is working on this, I would like to take this up. @AbhishekBabuji have you made any progress?

SSaishruthi commented 5 years ago

Is anyone working on this? I can take a look if no one is working

jnothman commented 5 years ago

It doesn't seem to have active work, @SSaishruthi

SSaishruthi commented 5 years ago

@jnothman I would like to take this up.

Do you have any pointers to start with?

I read through the issue. As per my understanding, the function should compute the correlation between features followed by the selection process.

How do you think the selection can happen? Removing the feature which is on average the most correlated with other features will work?

hermidalc commented 5 years ago

@SSaishruthi please read my previous comments on this, I think sklearn forcing filtering out features that are correlated with others is a big mistake, there are problems/use cases when you want to keep most correlated features and filter out those which are not.

Functionality should be left to user or sklearn should provide both functionalities, filtering out most redundant correlated features thereby keeping representative features which do not correlate with each other AND filtering out features which do not correlate with others thereby keeping features which correlate with each other.

SSaishruthi commented 5 years ago

@hermidalc Thanks for the inputs. I think we can have two functionalities.

jnothman commented 5 years ago

There's an approach called mRMR - "maximum relevance minimum redundancy"

amueller commented 5 years ago

mRMR is supervised, right? I think having both supervised and unsupervised methods would be helpful.

@hermidalc the point is to provide a transformer so this could be done in a pipeline. It's very easy to compute pairwise covariance and pearson correlation. It's not sklearn forcing anything, it's sklearn providing an estimator that does a certain transformation. Whether it's appropriate for the given dataset it up to the user. But the point would be to have it in a way that could be used as a transformation.

hermidalc commented 5 years ago

@hermidalc the point is to provide a transformer so this could be done in a pipeline. It's very easy to compute pairwise covariance and pearson correlation. It's not sklearn forcing anything, it's sklearn providing an estimator that does a certain transformation. Whether it's appropriate for the given dataset it up to the user. But the point would be to have it in a way that could be used as a transformation.

Sorry maybe I've misunderstood this thread. I thought the goal was to build a correlation feature selection method that on fit computed correlation between features (using one of a few possible methods) then on transform my input was that we should provide more than one way to filter such features using the correlations.

From my experience in compbio research many times you don't want to remove all but one or a few representative features in each group of features that are correlated with each other. Here's an example, suppose you have a gene expression dataset and none of the genes are strongly correlated though you have weaker correlations. Only by feature selecting the entire group of weakly correlated genes together will you discover that these have good predictive value, but by only selecting one or a few representative then you will find nothing.

If you want to go down the route of adding multivariate, or conditionally univariate, correlation-based feature selection methods such as CFS, FCBF, mRMR, ReliefF then I understand this is more than what I thought was the goal. Multivariate methods are terribly slow, so it's important to have bindings or compute intensive part writen in C, C++, Cython etc. There is an existing Python 3 binding for mRMR that I've used https://github.com/fbrundu/pymrmr.

amueller commented 5 years ago

@hermidalc Sure, we can provide different ways to do selection, my point was that we want these to be phrased as ready made transformations. If you have suggestions on what methods to offer for selecting among the correlated features I'm sure we'd be happy to include (if there's relevant references).

scikit-learn algorithms are mostly written in Cython and we try to avoid including 3rd party C or C++ code.

SSaishruthi commented 5 years ago

@amueller What is the suggested approach for me to start with? For now, I am thinking of getting correlation between features, get an average of each feature correlation and filter accordingly.

Will that work?

hermidalc commented 5 years ago

@hermidalc Sure, we can provide different ways to do selection, my point was that we want these to be phrased as ready made transformations. If you have suggestions on what methods to offer for selecting among the correlated features I'm sure we'd be happy to include (if there's relevant references).

scikit-learn algorithms are mostly written in Cython and we try to avoid including 3rd party C or C++ code.

Ok make sense thanks for the clarification.

I forgot to mention that many of these existing multivariate feature selection methods require discretization of continuous input data before use. I know sklearn has KBinsDiscretizer() though I found, at least in R libraries which implement various multivariate feature selection methods, that they use this MDL method via RWeka::Discretize() or FSelectorCpp::discretize().

U. M. Fayyad and K. B. Irani. Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. In 13th International Joint Conference on Uncertainly in Artificial Intelligence(IJCAI93), pages 1022-1029, 1993.

divyaprabha123 commented 4 years ago

As a starting point can we remove exactly collinear variables just by using rank of the matrix? Can I raise a pull request for this issue? @rth

rth commented 4 years ago

As a starting point can we remove exactly collinear variables just by using rank of the matrix? Can I raise a pull request for this issue? @rth

https://github.com/scikit-learn/scikit-learn/pull/14698 is already in rather good shape and only requires review. Once it is merged, improvements would be welcome. Also if you have any comments about that implementation don't hesitate to comment there.

You mean that if CorrelationThreshold(threshold=1.0) deciding whether it's necessary computing the correlation matrix by checking the rank of the matrix? That could work, but it sound like a special case optimization and we would still need the general implementation in that PR to know which columns to drop unless I'm missing something.

divyaprabha123 commented 4 years ago

As a starting point can we remove exactly collinear variables just by using rank of the matrix? Can I raise a pull request for this issue? @rth

14698 is already in rather good shape and only requires review. Once it is merged, improvements would be welcome. Also if you have any comments about that implementation don't hesitate to comment there.

You mean that if CorrelationThreshold(threshold=1.0) deciding whether it's necessary computing the correlation matrix by checking the rank of the matrix? That could work, but it sound like a special case optimization and we would still need the general implementation in that PR to know which columns to drop unless I'm missing something.

I just thought it doesn't matter which variable we remove if there is 100% correlated. So we don't need to check the correlation and decide Wright?

tadorfer commented 4 years ago

Here is a proposed function for removing highly correlated features:

# input data, 1st and 2nd column highly correlated
X = np.array([[3, 2, 9], [2, 1, 2], [2, 1, 1], [1, 0, 3]]) 

>>> X
array([[3, 2, 9],
       [2, 1, 2],
       [2, 1, 1],
       [1, 0, 3]])

def correlation_selection(X, threshold=.9, rowvar=False):
    """Remove highly correlated feature columns.

    Parameters
    -----------

    X : ndarray of shape (n_samples, n_features)

    threshold : float, default=.9

    rowvar: bool, default=False

    Returns
    --------

    X_reduced : ndarray of shape (n_samples, n_features_reduced)

    """

    corr = np.absolute(np.corrcoef(X, rowvar=rowvar))
    upper = corr*np.triu(np.ones(corr.shape), k=1).astype(np.bool)
    to_drop = [column for column in range(upper.shape[1]) if any(upper[:,column] >= threshold)]
    X_reduced = np.delete(X, to_drop, axis=1)

    return X_reduced

X_reduced = correlation_selection(X)

# feature matrix after removing correlated columns
>>> X_reduced
array([[3, 9],
       [2, 2],
       [2, 1],
       [1, 3]])

What do you all think?

thomasjpfan commented 4 years ago

The current issue is not the implementation, but if this feature should be included in scikit-learn at all. This is summarized here: https://github.com/scikit-learn/scikit-learn/pull/14698#issuecomment-590006773