Speedup QR decomposition

stan-dev / math

The Stan Math Library is a C++ template library for automatic differentiation of any order using forward, reverse, and mixed modes. It includes a range of built-in functions for probabilistic modeling, linear algebra, and equation solving.

https://mc-stan.org

BSD 3-Clause "New" or "Revised" License

733 stars 183 forks source link

Speedup QR decomposition #1229

Open t4c1 opened 5 years ago

t4c1 commented 5 years ago

Description

Eigen's QR decomposition can be improved on with better parameter tunning. GPUs can be used for further speedup.

Example

QR decomposition is faster.

Expected Output

QR decomposition is faster.

Current Version:

v2.19.1

t4c1 commented 5 years ago

I have implemented three versions of the algorithm. First uses CPU, second GPU, and the last is hybrid version that uses GPU only for larger matrix products.

Here are some graphs showing speedups relative to implementation from Eigen. Even my CPU version is faster, despite being the same algorithm. I guess mine has better tuned its parameter (block size) for my CPU. Measurements were done on Core i5 2500 and GTX 1070. qr_speedup_const_cols_mkl qr_speedup_const_rows_mkl qr_speedup_square_mkl

The question is whether we want all three in stan math? CPU and hybrid are relatively simple implementations, while full GPU needs four new kernels.

wds15 commented 5 years ago

In case you have a Intel CPU.. could you compile things with the Intel MKL as a backend for Eigen? As far as I am aware Eigen can use that and it should make a difference for Intel CPUs at least.

t4c1 commented 5 years ago

I will try.

wds15 commented 5 years ago

Thanks for trying... I know this is a lot of work... but maybe the MKL is a great option as well. As we head in the direction of these optimisations it is good to have some overview.

t4c1 commented 5 years ago

I have used Eigen with MKL before, so I have everything already set up. Tests are already running.

t4c1 commented 5 years ago

I have updated the graphs with MKL results.

wds15 commented 5 years ago

So Eigen MKL parallel gives already some speedup.... but this seems like a unfair comparison in that your CPU version is single-core, right?

How many cores were used for parallel?

In any case, it seems that your proposed variant speeds up things.

t4c1 commented 5 years ago

Actually sequantial MKL speeds up things by around 20%, but that is hard to see on the scale of the graphs.

I don't think comparison is unfair, but speedups from running multiple chains in parallel are probably better.

Parallel MKL used all 4 cores.

spinkney commented 5 months ago

@t4c1 is there a branch with these versions?

t4c1 commented 5 months ago

Huh that is old. I found the branch here: https://github.com/bstatcomp/math/tree/gpu_qr