Open amueller opened 5 years ago
Can I take this? My kneejerk thinking on how to approach this would be to scan features that are numerical, filter out the features that fall below a certain threshold. If you feel this is not something that a beginner can do, then I'll sit it out. I've seen the code base and I have an understanding of how you check for the correct datatype (array like etc), and I know I would add this code along with the other parts where Feature Selection Section exist.
I already do this using the cor() function in R and exposing via rpy2 to sklearn. Feature selection method should let you choose from pearson (default), spearman, or kendall algorithms for example and also require a cutoff threshold for correlation (or anti-correlation). The same could be done with scipy.
If two features are correlated (above a certain threshold), what would be a good deterministic way to decide which one to keep? The simplest solution would be to keep the feature with the smaller column index.
If two features are correlated (above a certain threshold), what would be a good deterministic way to decide which one to keep? The simplest solution would be to keep the feature with the smaller column index.
I think this should always be left to the user and their particular problem. For some problems you do not want to remove correlated features and for some you want to select a representative one to remove redundancy.
I think this new functionality should calculate the pairwise correlation values between features as well as their associated p-values and then allow the user to select/filter features based on thresholds from these. Then we might add additional options which will allow additional filtering for different use cases.
@thomasjpfan I think the standard is to drop the feature that's most correlated with the remaining set of features.
For some problems you do not want to remove correlated features and for some you want to select a representative one to remove redundancy.
Perhaps it might be worth considering an option that could be passed to the correlation feature selection class that would tell it to exclude certain columns from consideration.
Also, would this include the ability to pick how the correlation coefficient is calculated, e.g. use a nonparametric method like Kendall tau instead of Pearson's correlation?
If nobody is working on this, I would like to take this up. @AbhishekBabuji have you made any progress?
Is anyone working on this? I can take a look if no one is working
It doesn't seem to have active work, @SSaishruthi
@jnothman I would like to take this up.
Do you have any pointers to start with?
I read through the issue. As per my understanding, the function should compute the correlation between features followed by the selection process.
How do you think the selection can happen? Removing the feature which is on average the most correlated with other features will work?
@SSaishruthi please read my previous comments on this, I think sklearn forcing filtering out features that are correlated with others is a big mistake, there are problems/use cases when you want to keep most correlated features and filter out those which are not.
Functionality should be left to user or sklearn should provide both functionalities, filtering out most redundant correlated features thereby keeping representative features which do not correlate with each other AND filtering out features which do not correlate with others thereby keeping features which correlate with each other.
@hermidalc Thanks for the inputs. I think we can have two functionalities.
There's an approach called mRMR - "maximum relevance minimum redundancy"
mRMR is supervised, right? I think having both supervised and unsupervised methods would be helpful.
@hermidalc the point is to provide a transformer so this could be done in a pipeline. It's very easy to compute pairwise covariance and pearson correlation. It's not sklearn forcing anything, it's sklearn providing an estimator that does a certain transformation. Whether it's appropriate for the given dataset it up to the user. But the point would be to have it in a way that could be used as a transformation.
@hermidalc the point is to provide a transformer so this could be done in a pipeline. It's very easy to compute pairwise covariance and pearson correlation. It's not sklearn forcing anything, it's sklearn providing an estimator that does a certain transformation. Whether it's appropriate for the given dataset it up to the user. But the point would be to have it in a way that could be used as a transformation.
Sorry maybe I've misunderstood this thread. I thought the goal was to build a correlation feature selection method that on fit computed correlation between features (using one of a few possible methods) then on transform my input was that we should provide more than one way to filter such features using the correlations.
From my experience in compbio research many times you don't want to remove all but one or a few representative features in each group of features that are correlated with each other. Here's an example, suppose you have a gene expression dataset and none of the genes are strongly correlated though you have weaker correlations. Only by feature selecting the entire group of weakly correlated genes together will you discover that these have good predictive value, but by only selecting one or a few representative then you will find nothing.
If you want to go down the route of adding multivariate, or conditionally univariate, correlation-based feature selection methods such as CFS, FCBF, mRMR, ReliefF then I understand this is more than what I thought was the goal. Multivariate methods are terribly slow, so it's important to have bindings or compute intensive part writen in C, C++, Cython etc. There is an existing Python 3 binding for mRMR that I've used https://github.com/fbrundu/pymrmr.
@hermidalc Sure, we can provide different ways to do selection, my point was that we want these to be phrased as ready made transformations. If you have suggestions on what methods to offer for selecting among the correlated features I'm sure we'd be happy to include (if there's relevant references).
scikit-learn algorithms are mostly written in Cython and we try to avoid including 3rd party C or C++ code.
@amueller What is the suggested approach for me to start with? For now, I am thinking of getting correlation between features, get an average of each feature correlation and filter accordingly.
Will that work?
@hermidalc Sure, we can provide different ways to do selection, my point was that we want these to be phrased as ready made transformations. If you have suggestions on what methods to offer for selecting among the correlated features I'm sure we'd be happy to include (if there's relevant references).
scikit-learn algorithms are mostly written in Cython and we try to avoid including 3rd party C or C++ code.
Ok make sense thanks for the clarification.
I forgot to mention that many of these existing multivariate feature selection methods require discretization of continuous input data before use. I know sklearn has KBinsDiscretizer() though I found, at least in R libraries which implement various multivariate feature selection methods, that they use this MDL method via RWeka::Discretize() or FSelectorCpp::discretize().
U. M. Fayyad and K. B. Irani. Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. In 13th International Joint Conference on Uncertainly in Artificial Intelligence(IJCAI93), pages 1022-1029, 1993.
As a starting point can we remove exactly collinear variables just by using rank of the matrix? Can I raise a pull request for this issue? @rth
As a starting point can we remove exactly collinear variables just by using rank of the matrix? Can I raise a pull request for this issue? @rth
https://github.com/scikit-learn/scikit-learn/pull/14698 is already in rather good shape and only requires review. Once it is merged, improvements would be welcome. Also if you have any comments about that implementation don't hesitate to comment there.
You mean that if CorrelationThreshold(threshold=1.0)
deciding whether it's necessary computing the correlation matrix by checking the rank of the matrix? That could work, but it sound like a special case optimization and we would still need the general implementation in that PR to know which columns to drop unless I'm missing something.
As a starting point can we remove exactly collinear variables just by using rank of the matrix? Can I raise a pull request for this issue? @rth
14698 is already in rather good shape and only requires review. Once it is merged, improvements would be welcome. Also if you have any comments about that implementation don't hesitate to comment there.
You mean that if
CorrelationThreshold(threshold=1.0)
deciding whether it's necessary computing the correlation matrix by checking the rank of the matrix? That could work, but it sound like a special case optimization and we would still need the general implementation in that PR to know which columns to drop unless I'm missing something.
I just thought it doesn't matter which variable we remove if there is 100% correlated. So we don't need to check the correlation and decide Wright?
Here is a proposed function for removing highly correlated features:
# input data, 1st and 2nd column highly correlated
X = np.array([[3, 2, 9], [2, 1, 2], [2, 1, 1], [1, 0, 3]])
>>> X
array([[3, 2, 9],
[2, 1, 2],
[2, 1, 1],
[1, 0, 3]])
def correlation_selection(X, threshold=.9, rowvar=False):
"""Remove highly correlated feature columns.
Parameters
-----------
X : ndarray of shape (n_samples, n_features)
threshold : float, default=.9
rowvar: bool, default=False
Returns
--------
X_reduced : ndarray of shape (n_samples, n_features_reduced)
"""
corr = np.absolute(np.corrcoef(X, rowvar=rowvar))
upper = corr*np.triu(np.ones(corr.shape), k=1).astype(np.bool)
to_drop = [column for column in range(upper.shape[1]) if any(upper[:,column] >= threshold)]
X_reduced = np.delete(X, to_drop, axis=1)
return X_reduced
X_reduced = correlation_selection(X)
# feature matrix after removing correlated columns
>>> X_reduced
array([[3, 9],
[2, 2],
[2, 1],
[1, 3]])
What do you all think?
The current issue is not the implementation, but if this feature should be included in scikit-learn at all. This is summarized here: https://github.com/scikit-learn/scikit-learn/pull/14698#issuecomment-590006773
I wonder whether we should add unsupervised feature selection using correlation. It's a strategy that's very common in statistics and might just have it for completeness. It's certainly more useful than the variance based one, which I think is useless for any threshold != 0.