ianozsvald / beyond_correlation

Exploratory code to see if we can learn about feature relationships in a DataFrame using machine learning
MIT License
55 stars 19 forks source link

Try mutual information, GreedyBayes? #10

Open DrAndiLowe opened 6 years ago

DrAndiLowe commented 6 years ago

Hi Ian,

Instead of using standard correlation tests (Pearson, Spearman and Kendall) have you considered using a normalised variant of mutual information to capture nonlinear relationships between pairs of features? I've used symmetric uncertainty for this in my own work, after first discretising features and using a recipe based on Freedman-Diaconis to choose the number of bins. I used the Miller-Madow asymptotic bias corrected estimator for MI. You might want to check out the GreedyBayes algorithm for using MI to generate a Bayesian network from the feature space; see the paper here: Jun Zhang et al., PrivBayes: private data release via Bayesian networks. In SIGMOD.

DrAndiLowe commented 6 years ago

There is also distance correlation, but I never found it useful for my own work because it involves constructing a distance matrix that maxed-out RAM. Something to try if your data is small?

ianozsvald commented 6 years ago

Hey @andrewjohnlowe, thanks for the comments :-) I have no experience with MI, I'd be happy to take a look at the paper down the line, cheers.

DrAndiLowe commented 6 years ago

Look here: DataResponsibly/DataSynthesizer. Specifically, the lib directory that contains a Python implementation of GreedyBayes and MI calculation from sklearn.

PeteBleackley commented 6 years ago

I could contribute a method to calculate Mutual Information between pairs of columns in a DataFrame, if you like.