pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.26k stars 1.95k forks source link

Add the option to use Spearman's rank correlation in polars.DataFrame.corr #14457

Open AndreaPi opened 9 months ago

AndreaPi commented 9 months ago

Description

Currently, polars.DataFrame.corr only supports the computation of Pearson correlation coefficients between columns, unlike pandas.DataFrame.corr which supports Pearson, Kendall Tau and Spearman's correlation coefficients. The lack of Spearman is particularly puzzling given that polars.corr supports the computation of the Spearman rank correlation coefficient among two columns. It shouldn't be too hard, then, to extend this functionality to the whole dataframe rather than for just two columns.

AndreaPi commented 8 months ago

Any news about this? I think it would be really useful for people doing Data Science with Polars.

cmdlineluser commented 8 months ago

Just looking into why they differ it seems DataFrame.corr dispatches to np.corrcoef

https://github.com/pola-rs/polars/blob/20bf981f06c44bd4cfb1c36754ba5162db329270/py-polars/polars/dataframe/frame.py#L10270

Whereas pl.corr was added later with native rust implementations:

https://github.com/pola-rs/polars/blob/main/crates/polars-plan/src/dsl/function_expr/correlation.rs

AndreaPi commented 8 months ago

Interesting! It would make sense then to drop the dispatch to np.corrcoef, and rely instead on the Rust implementation of pl.corr to compute the correlation matrix. This not only solves my issue (i.e., allowing for the computation of Spearman correlation coefficients) but it also prevents possible discrepancies between the results of pl.corr and DataFrame.corr. Since a correlation matrix must be symmetric, maybe it could be a good idea to compute only the elements above the diagonal: the diagonal elements can be set to 1, and the elements below the diagonal can be set to the corresponding above-diagonal element. Also, I suggest that, as in Pandas, only pairwise complete observation are used to compute the correlation matrix:

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html

AndreaPi commented 5 months ago

any hopes to add this to polars 1.0.0?