muellerzr / fastinference

A collection of inference modules for fastai2
https://muellerzr.github.io/fastinference
Apache License 2.0
89 stars 16 forks source link

Dendrogram correlates incorrectly continuous and categorical variables #22

Closed hal-314 closed 4 years ago

hal-314 commented 4 years ago

Hi

I notice that the current dendrogram implementation doesn't differentiate between categorical and continuous variables. It compute correlation as if they were categorical variables through Cramer V statistic. However, this statistic is designed to be employed with categorical variables (wikipedia).

For continuous variables, Spearman or Kendall correlation. I would recommend to use Kendall as Spearman supose that relationship between them is always positive or negative (from wikipedia: "It assesses how well the relationship between two variables can be described using a monotonic function").

Here is an example of misleading users. You can see correlation between continuous variables estimated by Cramer V, Spearman and Kendall: Screenshot 2020-10-21 at 10 46 41

Cramer V correlation is quite different than Spearman or Kendall.

In case of ordinal variables, you could treat them as categorical or continuous variables if kendall or spearman correlation is used.

I don't know how to assess relationship between categorical and continuous variables. I would use Kruskal-Wallis test (non parametric version of one-way ANOVA) to test if it exists. However, I don't know how to quantify it :/ .

Here is a nice introduction to the problem.

So, current plot_dendrogram implementation can mislead users. My proposition is to change the current plot_dendrogram function so:

  1. Perform two dendrogram, one for categorical variables and another for continuous features.
  2. Optionally, allow to pass which variables variables will be treated as categorical and which as continuous.
  3. Make passing a dataframe optional. By default, make use of training dataframe.

I'll make a PR to fix it.

EDIT: the same applies to get_top_corr_dict