Threading behavior depends on backend?

adoa commented 6 years ago

I recently played around with Rust for a stochastic simulation (dynamical Monte Carlo via Gillespie's algorithm). In order to increase the performance, I parallelized it via std::thread over different initial conditions. The state space I use is a lattice in several dimensions, represented by Array1<u64> from ndarray. So far nothing worth mentioning.

Now I had to adapt the stochastic simulation to do some additional linear algebra along the way (nothing too crazy – just some dozen-dimensional linear problems) and decided to go for ndarray-linalg with its recommended backend openblas – instead of re-implementing linear algebra routines or pulling in another linear algebra package. However, this caused the parallelization via std::threads to not work anymore: The threads were still visible in htop, but they all stayed on the same core – not speeding up the calculation as intended. The only way I could get it to work as expected was to switch to the backend netlib. Now my threads populate different cores again and, in this sense, I do not have a problem that needs fixing anymore.

However, it took me a long time to figure this out. I am not an expert in the different backends. Is this known and expected behavior? Is it even intended? Is my "solution" to use netlib a true solution or is it a weird workaround, possibly causing some other problems in the future?

A simple example project reproducing the observation is given below. With openblas it does not distribute over several cores. With netlib instead it does. Note that the linear algebra operations in this example only appear at the very end – and just once per thread.

Cargo.toml
---
[package]
name = "example"
version = "0.1.0"

[dependencies]
rand = "0.5"
ndarray = "0.11"
# ndarray-linalg = {version = "0.9", features=["netlib"]}
ndarray-linalg = {version = "0.9", features=["openblas"]}

main.rs
---
extern crate rand;
extern crate ndarray;
extern crate ndarray_linalg;

use std::thread;

use rand::Rng;
use ndarray::{arr1, Array1};
use ndarray::{arr2, Array2};

use ndarray_linalg::Solve;

fn main() {
    println!("Hello, world!");

    let mut handles = vec![];

    let n = 1000000000;

    for idx in 0..4 {
        let handle = thread::spawn(move || {

            let mut rng = rand::thread_rng();

            let mut x: f64 = 0.;

            for _ in 0..n {
                x += rng.gen::<f64>();
            }

            println!("Thread no {} finished averaging {} random numbers. \
                     Result: {}", idx, n, x/(n as f64));

            let matrix: Array2<f64> = arr2(&[[1., 1., 0.],[0., 2., 0.], [0., 0., 3.]]);
            let vector: Array1<f64> = arr1(&[rng.gen(), rng.gen(), rng.gen()]);

            let result = matrix.solve_into(vector).unwrap();

            println!("Thread no {} says: result = {}", idx, result);

        });

        handles.push(handle);
    }

    for handle in handles {
        handle.join().unwrap();
    }

}

termoshtt commented 6 years ago

Your code seems to take a long time at x += rng.gen::<f64>(); for large n since the matrix is too small. Are you saying that the thread spawn itself is affected by the backend selection?

ndarray-linalg does not sync each api calls including solve, and four all backend LAPACK implementations also do not.

adoa commented 6 years ago

My code takes a long time at x += rng.gen::<f64>(); because it is repeated n = 1000000000 times. This addition of random numbers is entirely unrelated to the matrix operations. Its only purpose is to create CPU load.

Moreover, I know that the matrices I am working with are small enough to be handled on a single CPU core. In fact, I do not need the internals of the solve_into method to work concurrently. I suspect that the fact that openblas does work concurrently (to speed up the matrix multiplication) is somehow interfering with the threads that I spawn. But then again, I am no expert on what openblas or the netlib version of blas are actually doing. Maybe the netlib blas is just as concurrent as openblas – honestly, I have no idea.

As I said, the given code is supposed to illustrate my observation, it is not the actual simulation I am interested in. Once again: my observation is that with the backend openblas the above code creates 4 threads that remain on the same core of my CPU. In other words: when I compile the code via $ cargo build --release and check the execution time via $ time cargo run --release, I get the following output

for backend openblas:

Hello, world!
Thread no 0 finished averaging 1000000000 random numbers. Result: 0.49999078911875855
Thread no 0 says: result = [-0.1497393696152669, 0.425437614485622, 0.02453177625954876]
Thread no 3 finished averaging 1000000000 random numbers. Result: 0.499991433853502
Thread no 3 says: result = [0.5986091128403466, 0.32172052365502973, 0.25444959751171564]
Thread no 2 finished averaging 1000000000 random numbers. Result: 0.4999872276691157
Thread no 2 says: result = [0.4335370142533838, 0.04960389023165551, 0.19594093773296817]
Thread no 1 finished averaging 1000000000 random numbers. Result: 0.5000047640149131
Thread no 1 says: result = [0.9207793764539549, 0.04266369418868077, 0.25757798072975974]
cargo run --release  18.29s user 0.19s system 101% cpu 18.248 total

for backend netlib:

Hello, world!
Thread no 2 finished averaging 1000000000 random numbers. Result: 0.4999994641660759
Thread no 2 says: result = [-0.31555953303615353, 0.4693003130876617, 0.10824291622042403]
Thread no 3 finished averaging 1000000000 random numbers. Result: 0.4999958195165672
Thread no 3 says: result = [0.860459658939488, 0.1205206858302787, 0.2614362692329853]
Thread no 0 finished averaging 1000000000 random numbers. Result: 0.5000182872080169
Thread no 0 says: result = [-0.19424866527420626, 0.3792818648760721, 0.2260402182401063]
Thread no 1 finished averaging 1000000000 random numbers. Result: 0.5000026187754052
Thread no 1 says: result = [0.18686955237156083, 0.0939948264568205, 0.16733912942573215]
cargo run --release  19.15s user 0.00s system 395% cpu 4.845 total

You can see that with openblas, my code uses only one CPU core (101% cpu) and therefore takes way longer (~18 seconds). The process, in fact, does spawn four child threads, but each uses about 25% of the CPU core. With the backend netlib on the other hand, each thread uses a different core – resulting in a total usage of four cores (395% cpu) and consequently everything is done in under five seconds. The only difference between them is the backend which I select in the Cargo.toml while the main.rs is exactly the same.

Can you reproduce this timing behavior on your machine as well? Maybe it is very specific to my computer: I use Rust 1.27.1 from rustup on an Ubuntu 18.04 desktop machine with 4.15.0-24-generic linux kernel. The processor is Intel Xeon E3-1240 V2 @ 3.40GHz. This is a quad-core with hyperthreading, making eight threads available. It could be that the scheduler of my operating system decides not to distribute the different threads over different CPU cores. But there must be a reason for this decision. And since I can reproduce it by choosing the backend, they must be related somehow.

I do not understand the following sentence you wrote in your comment:

ndarray-linalg does not sync each api calls including solve, and four all backend LAPACK implementations also do not.

termoshtt commented 4 years ago

stacked question issue. close

rust-ndarray / ndarray-linalg

Threading behavior depends on backend? #114