Closed adoa closed 4 years ago
Your code seems to take a long time at x += rng.gen::<f64>();
for large n
since the matrix is too small. Are you saying that the thread spawn itself is affected by the backend selection?
ndarray-linalg does not sync each api calls including solve
, and four all backend LAPACK implementations also do not.
My code takes a long time at x += rng.gen::<f64>();
because it is repeated n = 1000000000
times. This addition of random numbers is entirely unrelated to the matrix operations. Its only purpose is to create CPU load.
Moreover, I know that the matrices I am working with are small enough to be handled on a single CPU core. In fact, I do not need the internals of the solve_into
method to work concurrently. I suspect that the fact that openblas does work concurrently (to speed up the matrix multiplication) is somehow interfering with the threads that I spawn. But then again, I am no expert on what openblas or the netlib version of blas are actually doing. Maybe the netlib blas is just as concurrent as openblas – honestly, I have no idea.
As I said, the given code is supposed to illustrate my observation, it is not the actual simulation I am interested in. Once again: my observation is that with the backend openblas
the above code creates 4 threads that remain on the same core of my CPU. In other words: when I compile the code via $ cargo build --release
and check the execution time via $ time cargo run --release
, I get the following output
openblas
:
Hello, world!
Thread no 0 finished averaging 1000000000 random numbers. Result: 0.49999078911875855
Thread no 0 says: result = [-0.1497393696152669, 0.425437614485622, 0.02453177625954876]
Thread no 3 finished averaging 1000000000 random numbers. Result: 0.499991433853502
Thread no 3 says: result = [0.5986091128403466, 0.32172052365502973, 0.25444959751171564]
Thread no 2 finished averaging 1000000000 random numbers. Result: 0.4999872276691157
Thread no 2 says: result = [0.4335370142533838, 0.04960389023165551, 0.19594093773296817]
Thread no 1 finished averaging 1000000000 random numbers. Result: 0.5000047640149131
Thread no 1 says: result = [0.9207793764539549, 0.04266369418868077, 0.25757798072975974]
cargo run --release 18.29s user 0.19s system 101% cpu 18.248 total
netlib
:Hello, world!
Thread no 2 finished averaging 1000000000 random numbers. Result: 0.4999994641660759
Thread no 2 says: result = [-0.31555953303615353, 0.4693003130876617, 0.10824291622042403]
Thread no 3 finished averaging 1000000000 random numbers. Result: 0.4999958195165672
Thread no 3 says: result = [0.860459658939488, 0.1205206858302787, 0.2614362692329853]
Thread no 0 finished averaging 1000000000 random numbers. Result: 0.5000182872080169
Thread no 0 says: result = [-0.19424866527420626, 0.3792818648760721, 0.2260402182401063]
Thread no 1 finished averaging 1000000000 random numbers. Result: 0.5000026187754052
Thread no 1 says: result = [0.18686955237156083, 0.0939948264568205, 0.16733912942573215]
cargo run --release 19.15s user 0.00s system 395% cpu 4.845 total
You can see that with openblas
, my code uses only one CPU core (101% cpu) and therefore takes way longer (~18 seconds). The process, in fact, does spawn four child threads, but each uses about 25% of the CPU core. With the backend netlib
on the other hand, each thread uses a different core – resulting in a total usage of four cores (395% cpu) and consequently everything is done in under five seconds. The only difference between them is the backend which I select in the Cargo.toml
while the main.rs
is exactly the same.
Can you reproduce this timing behavior on your machine as well? Maybe it is very specific to my computer: I use Rust 1.27.1 from rustup on an Ubuntu 18.04 desktop machine with 4.15.0-24-generic linux kernel. The processor is Intel Xeon E3-1240 V2 @ 3.40GHz. This is a quad-core with hyperthreading, making eight threads available. It could be that the scheduler of my operating system decides not to distribute the different threads over different CPU cores. But there must be a reason for this decision. And since I can reproduce it by choosing the backend, they must be related somehow.
I do not understand the following sentence you wrote in your comment:
ndarray-linalg does not sync each api calls including solve, and four all backend LAPACK implementations also do not.
stacked question issue. close
I recently played around with Rust for a stochastic simulation (dynamical Monte Carlo via Gillespie's algorithm). In order to increase the performance, I parallelized it via
std::thread
over different initial conditions. The state space I use is a lattice in several dimensions, represented byArray1<u64>
fromndarray
. So far nothing worth mentioning.Now I had to adapt the stochastic simulation to do some additional linear algebra along the way (nothing too crazy – just some dozen-dimensional linear problems) and decided to go for
ndarray-linalg
with its recommended backendopenblas
– instead of re-implementing linear algebra routines or pulling in another linear algebra package. However, this caused the parallelization viastd::threads
to not work anymore: The threads were still visible inhtop
, but they all stayed on the same core – not speeding up the calculation as intended. The only way I could get it to work as expected was to switch to the backendnetlib
. Now my threads populate different cores again and, in this sense, I do not have a problem that needs fixing anymore.However, it took me a long time to figure this out. I am not an expert in the different backends. Is this known and expected behavior? Is it even intended? Is my "solution" to use
netlib
a true solution or is it a weird workaround, possibly causing some other problems in the future?A simple example project reproducing the observation is given below. With
openblas
it does not distribute over several cores. Withnetlib
instead it does. Note that the linear algebra operations in this example only appear at the very end – and just once per thread.