jturner314 / nditer

High performance iteration over n-dimensional arrays
Apache License 2.0
6 stars 1 forks source link

Roadmap #1

Open jturner314 opened 5 years ago

jturner314 commented 5 years ago
ethanhs commented 2 years ago

Out of curiosity, what is the current status of this crate? I would be happy to help work on the above outlined features, and I would love to see faster iteration merged into ndarray.

jturner314 commented 2 years ago

I just haven't had a chance to work on it in a couple of years. I think the design is generally pretty good, although there are a few areas which could be improved. The last time I worked on this, I think it was pretty much ready for use; I just wanted to test it more thoroughly. (I didn't feel comfortable releasing it on crates.io without more thorough testing.) I started work on automated, randomized testing (via the proptest crate), but I haven't finished that yet. I plan to eventually to finish the testing stuff and release nditer (and possibly integrate it into ndarray); I just haven't had much time to work on it.

It'll probably be at least a few months before I can get back to this project. You may find it useful as-is. I'm not sure the best way to contribute to this repository right now, since my time is limited. You could always fork the project and then add some testing, adapt it to your needs, etc. For a more immediate impact on ndarray, you could use some of the ideas to speed up ndarray::Zip. (See optimize.rs for the iteration optimizations.)

Out of curiosity, I just ran the iteration benchmarks with an up-to-date version of ndarray. While nditer is faster than ndarray for large arrays, the difference isn't as large as I remember. For small arrays, ndarray is faster, since it doesn't perform the more expensive optimizations. (It should be possible to make nditer perform comparably well by using a heuristic to skip the optimizations for small arrays, but I haven't done that yet.) For larger arrays, if the array is contiguous and in C/Fortran order, the performance is effectively the same for ndarray and nditer. For larger arrays with non-standard layouts (discontiguous or in unusual order), nditer is 10–20% faster than ndarray in the benchmarks. I suspect that with even larger arrays, the difference would be larger. The equal_lengths_permuted_ixdyn benchmark is a notable exception to the moderate gains observed in the other benchmarks – nditer is about 5 times faster than ndarray for the largest case. I'd guess that this is due to allocations in the inner loop of ndarray's implementation for this case.

Larger gains are possible with loop tiling optimizations, which I haven't pushed to GitHub yet. When iterating over multiple arrays of differing layout, loop tiling can be multiple times faster than a naïve implementation. In some preliminary testing, I saw gains of >3x when comparing nditer with loop tiling to ndarray, Julia, NumPy, NumExpr, and xtensor for an operation on two 10001x10001 arrays of differing layouts. Providing a nice API for SIMD operations would also be useful from a performance perspective, although that's largely blocked on std::simd becoming stable.

Other than performance, the primary benefit of nditer is the more flexible iterator adapters. The API is more similar to std::iter::Iterator than ndarray::Zip. It's convenient for operations which are difficult to express with ndarray::Zip's more restrictive API.

ethanhs commented 2 years ago

OK that's good to know, thank you! I also would love to work on improving testing ndarray code, it should help testing another crate I want to work on.

Also std::simd is in nightly now so I think an API based on it could be iterated on.