scientific-python / faster-scientific-python-ideas

Brainstorm how to make scientific Python ecosystem faster
BSD 3-Clause "New" or "Revised" License
10 stars 0 forks source link

Take advantage of contiguous arrays #6

Open itamarst opened 2 months ago

itamarst commented 2 months ago

NumPy views can point at non-contiguous chunks of memory. This means general purpose code needs to be able to accept both contiguous and non-contiguous memory, which means generic code that accepts NumPy arrays will have to assume non-contiguous memory. And this loses out on potential optimizations, in particular automatic usage of SIMD; if the compiler knows the array is contiguous, it can skip a bunch of stride computations and do more optimization.

Contiguous inputs are going to be very common; how common depends on the domain and function. So it would be good to get maximum speed for those.

That means compiling two versions of expensive functions, one for contiguous arrays and one for non-contiguous arrays, and choosing the appropriate one based on inputs. And as a library author I would like to do this with minimum code duplication!

Numba does this automatically, but for most languages this requires changes to the code.

itamarst commented 2 months ago

Cython support

Cython supports this by using fused types, with minimal code duplication. See the example here: https://pythonspeed.com/articles/faster-cython-simd/

It might be useful to document this pattern in the Cython documentation, at minimum. And it could in theory be added as language feature so as to minimize boilerplate.

itamarst commented 2 months ago

Rust support

The most commonly used crate for arrays is ndarray. It's unclear to me whether it can even generate code that's specifically for contiguous arrays.

rgommers commented 2 months ago

In many cases it will also be fine to only support contiguous arrays, and make a copy first when getting non-contiguous arrays (possibly in Python code, before passing it to a function in an extension module). This is a common patterns when using Pythran. The end result is usually better performance on the common case, while still supporting the non-common case.

itamarst commented 1 month ago

I'm a little wary of copying as a solution. High memory usage can have a significant impact on computation costs (RAM isn't cheap), and there's the risk of hitting the swapping performance cliff. And it's already super-easy to end up with way-too-high memory usage with explicit APIs.

So adding intermittent, hidden copying of large arrays seems like a bad idea in generic library APIs, at least. In the context of applications rather than libraries, where the author has better understanding of inputs and run time environment, it might be a good solution though.