smazzanti / mrmr

mRMR (minimum-Redundancy-Maximum-Relevance) for automatic feature selection at scale.
MIT License
522 stars 79 forks source link

Does this method actually remove redundnacy? #48

Open AllardJM opened 4 months ago

AllardJM commented 4 months ago

First, great library and related blog posts. I was beginning to code this procedure and then stumbled upon your work. Here is my question / concern. I am using data likely akin to Uber for marketing purposes (mix of continuous and dummy coded, some highly predictive features, some irrelevant and correlation between designed features). If I look at the complete list of features and count the number of features with an (absolute value) correlation over 0.6 there are many. After the feature selection I see more relative correlation. This issue seems to be the F-stat can be very large for some correlated features and it cant be dampened enough by the denominator.

Here is an example from your quick starts (with a bit of change)

from mrmr import mrmr_classif
from sklearn.datasets import make_classification

# create some data
X, y = make_classification(n_samples = 1000, n_features = 100, n_informative = 10, n_redundant = 40)
X = pd.DataFrame(X)
y = pd.Series(y)

corr_X = X.corr().abs().clip(0.00001)

threshold_corr = 0.6
pdf_feature_cnt_corr = pd.Series(corr_X.apply(lambda x: sum(x > threshold_corr)-1, axis=1))
ax = pdf_feature_cnt_corr.value_counts().sort_index().plot(kind = 'bar')
ax.set_xlabel(f'Number of Features Correlated With Feature\n Above {threshold_corr}')
ax.set_ylabel('Number of Features')
ax.bar_label(ax.containers[0])

image

# use mrmr classification
selected_features = mrmr_classif(X, y, K = 10)
threshold_corr = 0.6
pdf_feature_cnt_corr = pd.Series(corr_X.loc[selected_features,selected_features].apply(lambda x: sum(x > threshold_corr)-1, axis=1))
ax = pdf_feature_cnt_corr.value_counts().sort_index().plot(kind = 'bar')
ax.set_xlabel(f'Number of Features Correlated With Feature\n Above {threshold_corr}')
ax.set_ylabel('Number of Features')
ax.bar_label(ax.containers[0])

image

It seems to me we have far fewer features but the ones left show strong amount of correlation, in terms of proportion of the model candidate features that are correlated.....

erinMahoney commented 2 months ago

Hello we ran into this issue as well. Our solution was to transform the denominator (leveraging the redundancy parameter) using something like $\frac{1}{[1-abs(correlation)]^4}$, so that significantly correlated values (say correlation > 0.95) would be severely penalized. We also took the square root of the f-statistic.