heal-research / HeuristicLab

HeuristicLab - An environment for heuristic and evolutionary optimization
https://dev.heuristiclab.com
GNU General Public License v3.0
36 stars 16 forks source link

Modern implementation for Pearson R correlation and covariance #3090

Open HeuristicLab-Trac-Bot opened 3 years ago

HeuristicLab-Trac-Bot commented 3 years ago

Issue migrated from trac ticket # 3090

milestone: HeuristicLab 3.3.x Backlog | component: Problems.DataAnalysis | priority: medium

2020-12-02 00:00:31: @foolnotion created the issue


Our OnlinePearsonsRCalculator calculates the Pearson's R correlation coefficient from variance and covariance calculated by the OnlineMeanAndVarianceCalculator and the OnlineCovarianceCalculator using algorithms by Knuth (variance) and Welford (covariance).

Implementation-wise, OnlinePearsonsRCalculator class actually encapsulates three other calculator objects. With every pair of values, each of the three calculators performs basically the same checks on the input, leading to some inefficiency and more complicated and error-prone code.

We could instead be using an extension of Welford's algorithm by Schubert et al. [1] which computes everything (means, variances, covariances) in one pass, thus providing a simpler, numerically-stable, more performant implementation.

[1] https://dl.acm.org/doi/10.1145/3221269.3223036

HeuristicLab-Trac-Bot commented 3 years ago

2020-12-02 00:16:05: @foolnotion changed status from new to accepted

HeuristicLab-Trac-Bot commented 3 years ago

2020-12-02 00:16:05: @foolnotion commented


r17787: Implement extension to Welford's algorithm by Schubert et al in the OnlinePearsonsRCalculator.

HeuristicLab-Trac-Bot commented 3 years ago

2020-12-02 15:46:18: @foolnotion commented


r17788: Revert r17787 due to numerical precision issues.

HeuristicLab-Trac-Bot commented 3 years ago

2021-02-02 18:34:35: @foolnotion uploaded file r2_calculator_speedup.png (107.0 KiB)

HeuristicLab-Trac-Bot commented 3 years ago

2021-02-02 18:40:01: @foolnotion commented


Another possibility here:

  • use SIMD datatypes provided System.Numerics.Vector<double> (availabel as Nuget package) to implement a vectorized version of the R2 calculator (1) brings significant speedup.

  • Between the two possible implementations (Welford algorithm vs Youngs-Cramer algorithm), use the latter as it is faster due to better pipelining (less heavy data dependency).

[[Image(r2_calculator_speedup.png)]]

The image shows the difference between the current HL implementation (Welford) and the prototype SIMD-enabled Youngs-Cramer implementation. The speed-up is very significant

(1) https://dl.acm.org/doi/10.1145/3221269.3223036