better motivating examples ....

rmflight commented 3 years ago

I understand mtcars is convenient, and the GenData is nice because it makes some nice example data. And the examples using them are nice. I understand that other statisticians might understand why they want to use these methods, but why as a regular user should I use it? What are the issues with doing pearson and kendall correlations with these types of data? How do the values change?

muellsen commented 3 years ago

Thanks for the comment. This is, indeed, a great point. We are, however, not sure how to answer the question succinctly. There is certainly nothing "wrong" per se with computing Pearson or Kendall correlation on the data. These estimators simply provide a different measure of association . As is often advertised with respect to Kendall or Spearman, they can pick up, to a certain extent, "non-linear" dependencies among variables. So, for instance, Spearman correlation can detect when the two variables are monotonically related (not just linear). So regarding the question, "how do the values change", this is even in the Spearman vs. Pearson a purely data-dependent question and cannot be answered generally. The same is true for latent correlation vs. Pearson correlation (as shown in the two examples, e.g.). In the mtcars data, we see both increased and decreased "correlation estimations" dependent on the types and entries of the data.

What we can say about latentcor, however, is the following:

Compared to Pearson or rank-based approaches, latentcor follows a slightly different philosophy by making use of a semi-parametric (latent) Gaussian copula. The seminal papers that introduce those concepts in (high-dimensional) statistics are relatively recent (Liu et al, 2009, and especially, Fan et al. 2017, as referenced in our paper) where only continuous and binary variables were considered. The other papers mentioned in the Table in this paper extend this framework to other variable types. In addition, this paper and vignette also introduce a novel (and important) case, namely the zero-inflated continuous/ternary case. The pivotal role in the approach is the Bridge function that allows one to (almost) analytically estimate a latent "true" correlation from something that can be estimated from data (Kendall's tau). These bridge functions are different for every variable type, as referred to in the paper, and implemented here for the first time

If we were bold, we could argue that, if your data falls into one of categories we outline here, one should always compute latent correlations in practice since it captures, in some sense, a more statistically interpretable quantity (the latent correlation is similar to Pearson correlation then) that can then conveniently be used for downstream analysis (say, learn graphical models, do CCA, etc.).

Why practitioners have not used latent correlations so far is maybe due to two reasons: i) The idea of latent correlation is relatively new, and not so many people are aware of it. ii) There was a considerable computational burden associated with it. In the original method, not only did one need to compute Kendall's tau (which is costly compared to Pearson) but one also needed to solve a univariate optimization problem for every entry of the correlation matrix.

We hope that latentcor removes the second barrier and provides easy-to-use fast software so practioners can try on their data. For early "success stories" of the use of latent correlations, we may refer to the Yoon et al. papers that showcase improved performance on real-world data for CCA and graphical models, respectively.

rmflight commented 3 years ago

Thank you for the explanation.

I also think the examples in the vignette and paper are improved.

mingzehuang / latentcor

better motivating examples .... #6