sfu-db / dataprep

Open-source low code data preparation library in python. Collect, clean and visualization your data in python with a few lines of code.
http://dataprep.ai
MIT License
2.06k stars 206 forks source link

Support Categorical Correlation #482

Open jinglinpeng opened 3 years ago

jinglinpeng commented 3 years ago

Is your feature request related to a problem? Please describe. Currently the plot_correlation only works for numerical variable. This issue extends plot_correlation to support categorical variable.

Describe the solution you'd like

  1. plot_correlation(df): Add Cramer V correlation matrix for all categorical columns Time: 2021.01.20-2021.01.27
  2. plot_correlation(df, x = cat): Add Cramer V correlation for categorical columns. Time: 2021.01.27-2021.02.03
  3. Add doc and test Time: 2021.02.03-2021-02.10

Reference:

  1. https://towardsdatascience.com/powerful-eda-exploratory-data-analysis-in-just-two-lines-of-code-using-sweetviz-6c943d32f34
  2. https://medium.com/@outside2SDs/an-overview-of-correlation-measures-between-categorical-and-continuous-variables-4c7f85610365

Describe alternatives you've considered NA Additional context NA

Abdelgha-4 commented 3 years ago

Hi @jinglinpeng I suggest that you add Phik correlation too,

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution.

Here is extensive documentation available here https://phik.readthedocs.io/en/latest/.

dovahcrow commented 3 years ago

Thanks @Abdelgha-4 for the suggestion! Indeed we once considered the PhiK correlation at https://github.com/sfu-db/dataprep/pull/145. However, PhiK is generally very slow comparing to other correlations so we decide to defer the implementation until someone thinks this is really needed.

Abdelgha-4 commented 3 years ago

I see! sorry then, I didn't notice it was already discussed.

dovahcrow commented 3 years ago

I see! sorry then, I didn't notice it was already discussed.

No worries! If you think this is an important feature then we can certainly add it.