I notice that the current dendrogram implementation doesn't differentiate between categorical and continuous variables. It compute correlation as if they were categorical variables through Cramer V statistic. However, this statistic is designed to be employed with categorical variables (wikipedia).
For continuous variables, Spearman or Kendall correlation. I would recommend to use Kendall as Spearman supose that relationship between them is always positive or negative (from wikipedia: "It assesses how well the relationship between two variables can be described using a monotonic function").
Here is an example of misleading users. You can see correlation between continuous variables estimated by Cramer V, Spearman and Kendall:
Cramer V correlation is quite different than Spearman or Kendall.
In case of ordinal variables, you could treat them as categorical or continuous variables if kendall or spearman correlation is used.
I don't know how to assess relationship between categorical and continuous variables. I would use Kruskal-Wallis test (non parametric version of one-way ANOVA) to test if it exists. However, I don't know how to quantify it :/ .
Hi
I notice that the current dendrogram implementation doesn't differentiate between categorical and continuous variables. It compute correlation as if they were categorical variables through Cramer V statistic. However, this statistic is designed to be employed with categorical variables (wikipedia).
For continuous variables, Spearman or Kendall correlation. I would recommend to use Kendall as Spearman supose that relationship between them is always positive or negative (from wikipedia: "It assesses how well the relationship between two variables can be described using a monotonic function").
Here is an example of misleading users. You can see correlation between continuous variables estimated by Cramer V, Spearman and Kendall:
Cramer V correlation is quite different than Spearman or Kendall.
In case of ordinal variables, you could treat them as categorical or continuous variables if kendall or spearman correlation is used.
I don't know how to assess relationship between categorical and continuous variables. I would use Kruskal-Wallis test (non parametric version of one-way ANOVA) to test if it exists. However, I don't know how to quantify it :/ .
Here is a nice introduction to the problem.
So, current
plot_dendrogram
implementation can mislead users. My proposition is to change the currentplot_dendrogram
function so:I'll make a PR to fix it.
EDIT: the same applies to
get_top_corr_dict