Does this method actually remove redundnacy?

First, great library and related blog posts. I was beginning to code this procedure and then stumbled upon your work. Here is my question / concern. I am using data likely akin to Uber for marketing purposes (mix of continuous and dummy coded, some highly predictive features, some irrelevant and correlation between designed features). If I look at the complete list of features and count the number of features with an (absolute value) correlation over 0.6 there are many. After the feature selection I see more relative correlation. This issue seems to be the F-stat can be very large for some correlated features and it cant be dampened enough by the denominator.

Here is an example from your quick starts (with a bit of change)

from mrmr import mrmr_classif
from sklearn.datasets import make_classification

# create some data
X, y = make_classification(n_samples = 1000, n_features = 100, n_informative = 10, n_redundant = 40)
X = pd.DataFrame(X)
y = pd.Series(y)

corr_X = X.corr().abs().clip(0.00001)

threshold_corr = 0.6
pdf_feature_cnt_corr = pd.Series(corr_X.apply(lambda x: sum(x > threshold_corr)-1, axis=1))
ax = pdf_feature_cnt_corr.value_counts().sort_index().plot(kind = 'bar')
ax.set_xlabel(f'Number of Features Correlated With Feature\n Above {threshold_corr}')
ax.set_ylabel('Number of Features')
ax.bar_label(ax.containers[0])

# use mrmr classification
selected_features = mrmr_classif(X, y, K = 10)

threshold_corr = 0.6
pdf_feature_cnt_corr = pd.Series(corr_X.loc[selected_features,selected_features].apply(lambda x: sum(x > threshold_corr)-1, axis=1))
ax = pdf_feature_cnt_corr.value_counts().sort_index().plot(kind = 'bar')
ax.set_xlabel(f'Number of Features Correlated With Feature\n Above {threshold_corr}')
ax.set_ylabel('Number of Features')
ax.bar_label(ax.containers[0])

It seems to me we have far fewer features but the ones left show strong amount of correlation, in terms of proportion of the model candidate features that are correlated.....

smazzanti / mrmr

Does this method actually remove redundnacy? #48