Open AndreaPi opened 9 months ago
Any news about this? I think it would be really useful for people doing Data Science with Polars.
Just looking into why they differ it seems DataFrame.corr
dispatches to np.corrcoef
Whereas pl.corr
was added later with native rust implementations:
https://github.com/pola-rs/polars/blob/main/crates/polars-plan/src/dsl/function_expr/correlation.rs
Interesting! It would make sense then to drop the dispatch to np.corrcoef
, and rely instead on the Rust implementation of pl.corr
to compute the correlation matrix. This not only solves my issue (i.e., allowing for the computation of Spearman correlation coefficients) but it also prevents possible discrepancies between the results of pl.corr
and DataFrame.corr
. Since a correlation matrix must be symmetric, maybe it could be a good idea to compute only the elements above the diagonal: the diagonal elements can be set to 1, and the elements below the diagonal can be set to the corresponding above-diagonal element. Also, I suggest that, as in Pandas, only pairwise complete observation are used to compute the correlation matrix:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html
any hopes to add this to polars 1.0.0?
Description
Currently, polars.DataFrame.corr only supports the computation of Pearson correlation coefficients between columns, unlike pandas.DataFrame.corr which supports Pearson, Kendall Tau and Spearman's correlation coefficients. The lack of Spearman is particularly puzzling given that polars.corr supports the computation of the Spearman rank correlation coefficient among two columns. It shouldn't be too hard, then, to extend this functionality to the whole dataframe rather than for just two columns.