rust-ndarray / ndarray

ndarray: an N-dimensional array with array views, multidimensional slicing, and efficient operations
https://docs.rs/ndarray/
Apache License 2.0
3.53k stars 297 forks source link

Parallel indexed iterator #279

Closed msiglreith closed 7 years ago

msiglreith commented 7 years ago

IndexedIter and IndexedIterMut should have parallel iterator support.

Often it's desirable to have the element coordinates additionally to the element itself for accessing fields in other arrays. Example use case (part of fluid simulation code):

div.indexed_iter_mut().map(|((y, x), div)| {
    *div = -(vel.x[(y, x+1)] - vel.x[(y, x)] + vel.y[(y+1, x)] - vel.y[(y, x)]);
});

I quickly tried to implement it myself, similar to the AxisIter implementation, but unsure how to properly split the iterator.

bluss commented 7 years ago

Right. I think the indexed iterator is already slow, I fear.

mokasin commented 7 years ago

Related to that, I'm wondering how to compute something like a gradient on a field efficiently in parallel. This means, I need to assign a value to every array element, that depends on values in neighbouring indices. I tried this very (silly)naive approach, here simplified to 1D:

extern crate rayon;
extern crate ndarray;

use rayon::prelude::*;
use ndarray::Array;

fn main() {
    let a = Array::from_elem(10, 1);
    let mut b = Array::zeros(10);

    (1usize..8)
        .into_par_iter()
        .map(|i|
             b[i] = a[i+1] - a[i-1]
        );

    println!("array: {:?}", b);
}

which fails miserably:

error[E0387]: cannot borrow data mutably in a captured outer variable in an `Fn` closure
  --> src/main.rs:14:14
   |
14 |              b[i] = a[i+1] - a[i-1]
   |              ^
   |
help: consider changing this closure to take self by mutable reference
  --> src/main.rs:13:14
   |
13 |           .map(|i|
   |  ______________^ starting here...
14 | |              b[i] = a[i+1] - a[i-1]
   | |___________________________________^ ...ending here

I'd like to do something like this for an n-dimensional array. Besides the point, that this might split very cache unfriendly: How to do that?

bluss commented 7 years ago

Right and I would shoot for vectorization (SIMD parallelization) before thread based parallelization.

@mokasin We want to add a parallel version of array zips, which should address this on a basic level. For 1D we can of course use collect_into on a Vec and then make an array, since Vec → Array conversion has no allocation/copy cost.

bluss commented 7 years ago

The example here is not about threading, but it shows an example that can inspire a solution that computes the gradients using array wide operations. That will achieve some limited degree of simd parallelization at least.

https://bluss.github.io/rust-ndarray/master/ndarray/macro.s.html

bluss commented 7 years ago

The n-ary Zip work brings some new perspectives. It's not very simple to split an index iterator, but we can split a producer (Think of it like an IntoIterator, it doesn't have the problem of having a partially iterated state).

bluss commented 7 years ago

Ok, Zip now exists.

@msiglreith Let's try the idea that we don't need indices! The mapping x, y → y, x is a transpose, so that's easy. And with some slicing we can get both the current and the next element along an axis. Something like the following code. Maybe one should double check that the computation is correct, and also think hard about how to handle borders.

One can play with the memory layouts here to have it perform well. Div and vx, vy need to have opposite layouts since we access vx, vy transposed. I picked vx, vy to have f order, so that the sliced transposed arrays have contiguous rows, which is the case ndarray optimizes for by preference (It will be more open minded in the future)

Zip parallelization is basically ready, but needs rayon-git: #288 (It's possible I'll merge this soon enough, but can't publish before rayon has a new version.) Using parallelization is as simple as switching from apply to par_apply, when ndarray-parallel has the changes.


extern crate rand;
#[macro_use(s, azip)]
extern crate ndarray;
extern crate ndarray_rand;

use ndarray::prelude::*;
use ndarray_rand::RandomExt;
use rand::distributions::Range;
use ndarray::Zip;

fn main() {
    type Arrayf64<D> = Array<f64, D>;
    const N: usize = 10;
    let vx = Arrayf64::random((N, N).f(), Range::new(0., 100.));
    let vy = Arrayf64::random((N, N).f(), Range::new(0., 100.));
    println!("{:8.3}", vx);
    println!("{:8.3}", vy);
    let mut div = Arrayf64::zeros((N - 1, N - 1));

    Zip::from(&mut div)
        .and(vx.slice(s![..-1, ..-1]).t())
        .and(vx.slice(s![1.., ..-1]).t())
        .and(vy.slice(s![..-1, ..-1]).t())
        .and(vy.slice(s![..-1, 1..]).t())
        .apply(|div, &vx1, &vx2, &vy1, &vy2| {
            *div = -(vx1 - vx2 + vy1 - vy2);
        });
    println!("{:8.3}", div);
}
bluss commented 7 years ago

@mokasin We're going to add a window producer/iterable but I actually think something array wide is much faster that iterating the array in windows around each element.

Something array wide would indeed be something like:

b += a[2..] - a[..-2]   // pseudocode because Rust slicing is not this nice
msiglreith commented 7 years ago

Great work! Tried it in my code and seems to work fine (no visual difference in the resulting flow).

ndarray::Zip::from(div)
        .and(vel.x.slice(s![.., 1..]))
        .and(vel.x.slice(s![.., ..-1]))
        .and(vel.y.slice(s![1.., ..]))
        .and(vel.y.slice(s![..-1, ..]))
        .apply(|div, &vx1, &vx2, &vy1, &vy2| {
            *div = -(vx1 - vx2 + vy1 - vy2);
        });

I'm using a staggered grid, so boundary conditions are not really an issue, expect I needed to adjust the dimensions of the grids (vel.x and vel.y are larger than div). Therefore I added the additional slice operations .slice(s![.., ..-1]) and .slice(s![..-1, ..]).

A drawback of this approach is slightly less readability imo. I had to read up the documentation first to grasp the whole concept and even tough direct indexing is easier to read. Currently most of my code is based on tuple based indexing, some computational heavy parts are optimized by direct array index calculation. Zip based approach might serve as intermediate optimization step.

Divergence calculation isn't a performance heavy step for my fluid simulation. Even tough, is there an overhead for using this zip based approach in comparison to array indexing? Direct indexing sample:

let (h, w) = div.dim();

let mut div = div.as_slice_mut().unwrap();
let mut vx = vel.x.as_slice().unwrap();
let mut vy = vel.y.as_slice().unwrap();

let mut idx = 0;

for y in 0 .. h {
   for x in 0 .. w {
       div[idx] = -(vx[idx+y+1] - vx[idx+y] + vy[idx+w] - vy[idx]);
       idx += 1;
     }
 }

Thanks again for your work on this issue, I appreciate it!

bluss commented 7 years ago

Zip is created so that we have a safe Rust way to do this efficiently. It looks at the memory layouts of the inputs and tries to do its best. (If all inputs are C/F contiguous it uses a linear index through all array memories, otherwise it “unrolls” the innermost dimension and uses each array's stride × an index to loop over it).

That means if you take care with the memory layouts (see previous message about that), Zip should always win. There are no bounds checks, like there will be in your slice indexing example. I'm pretty sure the bounds checks don't optimize out in your example (but I would be happy to be positively surprised).

Note that this:

Zip::from(&mut div)
        .and(vx.slice(s![..-1, ..-1]).t())
        .and(vx.slice(s![1.., ..-1]).t())
        .and(vy.slice(s![..-1, ..-1]).t())
        .and(vy.slice(s![..-1, 1..]).t())
        .apply(|div, &vx1, &vx2, &vy1, &vy2| {
            *div = -(vx1 - vx2 + vy1 - vy2);
        })

is equivalent to:

div -= vx.slice(s![..-1, ..-1].t();
div += vx.slice(s![1.., ..-1]).t();
div -= vy.slice(s![..-1, ..-1]).t();
div += vy.slice(s![..-1, 1..]).t();

But I don't have a parallel version of the latter!

bluss commented 7 years ago

Zip::indexed exists and is parallelized in master, just not in a proper (non-alpha) release yet. It enters there, rather than among the iterators. We're expanding and extending the NdProducer concept now. Splitting is something that a producer or intoiterator can implement, not iterator, is our current idea.

bluss commented 7 years ago

Example use with rayon https://github.com/bluss/rust-ndarray/blob/994df3d2129da0f17af8db68129a747b5fc08297/parallel/tests/zip.rs#L65-L83

msiglreith commented 7 years ago

Amazing work! Thanks for the quick implementation, will play around with it in the next days :)

Renmusxd commented 2 years ago

Have there been any updates on the IndexedIter and IndexedIterMut having parallelization support?

bluss commented 2 years ago

rayon support is through Zip, see the previous comment https://github.com/rust-ndarray/ndarray/issues/279#issuecomment-292360748