rust-ndarray / ndarray

ndarray: an N-dimensional array with array views, multidimensional slicing, and efficient operations
https://docs.rs/ndarray/
Apache License 2.0
3.43k stars 295 forks source link

Fastest way to convert to vec/iterate elements #1339

Closed RunDevelopment closed 7 months ago

RunDevelopment commented 7 months ago

Hi! I wanted to know what's the fastest way to convert a PyReadonlyArrayDyn<f32> into a Vec<f32>.

I'm making python bindings for an image processing programs, and converting numpy arrays into a contiguous memory format is the first step. I'm already using ndarray.as_slice() which allows me to skip the conversion to vec. This issue is about indexed ndarrays. I'm currently using the following code for the conversion:

let flat: Vec<f32> = ndarray.as_array().iter().cloned().collect();

Compared to a slice.to_vec(), this is more than 10x slower. Running this code on a (4320, 8468, 4) ndarray takes 0.2sec if the array is a slice (so it runs slice.to_vec()) and 3sec when the array is indexed (I cropped off 1px on the left side).

This is a huge performance cliff for me, so I would like to know what I can do to fix this.

nilgoyette commented 7 months ago

By "indexed", do you mean "not the complete array, just a portion of it"?

In general, iter is slow in ndarray. You might want to test to_owned then to_vec and tell us about the performance. In all cases, 3s seems really high!

RunDevelopment commented 7 months ago

By "indexed", do you mean "not the complete array, just a portion of it"?

Sorry, I meant a sliced array. E.g. the cropped image array from my "benchmark" was created in python like this: img[:, 1:, ...].

You might want to test to_owned then to_vec and tell us about the performance.

Could you please explain this in more detail? Changing my above code to:

let flat: Vec<f32> = ndarray.to_owned().to_vec().unwrap();

Gives me a NotContiguousError at runtime.

So I tried using the to_vec method from ArrayBase. I also had to reshape the array since I'm working with 2D/3D images and to_vec is only available for 1D arrays. The following code did work correctly:

let flat: Vec<f32> = ndarray
    .as_array()
    .to_owned()
    .into_shape(ndarray.shape().iter().product::<usize>())
    .unwrap()
    .to_vec();

This clocked in at 1.1sec, which is still 6x slower than slice.to_vec(). The break-down is that to_owned takes 0.95sec and to_vec takes ~0.15sec. So copying from contiguous memory into a vec is as fast as ever, but to_owned is very slow.

RunDevelopment commented 7 months ago

I did some further testing and the main reason to_owned is so slow seems to be that my array uses IxDyn. Using .into_dimensionality() to "cast" the array to use Ix3 speeds up to_owned from 0.95sec to 0.25sec (4x faster). Doing this "cast" also sped up my old .iter().copied().collect() to take 0.4sec, so iteration is 8-9x faster.

So arrayBase.iter() should probably do something similar as an internal optimization.

I also found the .into_raw_vec() method which gets rid of the copying .to_vec() does. With the into_dimensionality() optimization, this brings it down to 0.25sec total, which seems to be the best that is possible right now.

adamreichold commented 7 months ago

If you do assume a dimensionality of 3, you could also change your Python API to expect PyReadonlyArray3<f32> instead.

RunDevelopment commented 7 months ago

If you do assume a dimensionality of 3

I don't. I just simplified my findings for the sake of brevity. In my code, I match on array.shape().len() to support 2D and 3D arrays (we use 2D arrays for grayscale images in the software I work with).

adamreichold commented 7 months ago

You might still be better of accepting an enum of PyReadonlyArray2<f32> or PyReadonlyArray3<f32>, c.f. https://github.com/PyO3/rust-numpy/blob/main/examples/simple/src/lib.rs#L113

RunDevelopment commented 7 months ago

Since I figured out the currently fastest way to get a vec, I consider this issue closed now.

Should I make a new issue for the IxDyn performance problem?

adamreichold commented 7 months ago

Should I make a new issue for the IxDyn performance problem?

Personally, I am not sure I consider it a problem as the iterator is most likely not the only place where IxDyn is measurably faster than Ix? and I would like to avoid trying to cushion all of those places out with small-dimensionality fast paths. I think the main learning is to use the fixed-dimensionality types if that is at all possible.

nilgoyette commented 7 months ago

Should I make a new issue for the IxDyn performance problem?

If you do, please explain clearly and with at least one code example.

RunDevelopment commented 7 months ago

I think the main learning is to use the fixed-dimensionality types if that is at all possible.

If you want to teach people how to use ndarrays most efficiently, I would suggest writing documentation or changing the API to make IxDyn less convenient to use. Right now, IxDyn is the closest thing to how ndarrays behave in python, so I would argue that people will naturally gravitate towards it. I certainly did.