Roadmap - Githubissues

jturner314 commented 5 years ago

[ ] Add FoldInplaceAxes (similar to FoldAxes but without iteration order limitation).
[ ] Consider whether it's really necessary for .merge_axes() and .invert_axis() to operate in-place on the producer. A cleaner solution may be to handle all layout mutation in the iterator.
[ ] Add support for zipping arbitrary subsets of axes, then
- [ ] Add ExactChunksAsAxes.
[ ] Add MaskAlongAxis.
[ ] Pairwise summation
- [X] Add method for pairwise summation.
- [ ] Add method for pairwise summation over a subset of axes.
[ ] Add rayon support (probably requires addition of NdConsumer and NdSink traits and Pipe adapter).
[x] Add strongly-typed axes and improve API for specifying axes (e.g. AxesAll, AxesOnly, AxesExcept).
[ ] Improve benchmarks and performance:
- [ ] Check if unrolling two inner loops in Iter is actually worth the extra complexity.
- [ ] Check if eliminating the heap allocations in the merged field of OptimRecorder would be beneficial.
- [ ] Benchmark with black box axis lengths.
- [ ] Confirm that BroadcastProducer is really faster than adding axes with stride 0 to the array. Preliminary benchmarking indicated that this is the case, but benchmarking is needed with different layouts and more complex producer wrappers (that don't inline as readily).
- [ ] See if a non-recursive implementation of pairwise_sum (i.e. with an explicit stack) is faster.
- [ ] Add loop tiling optimizations.
- [ ] Skip the more expensive optimizations for small arrays.
[ ] Add methods for explicit SIMD.
[ ] Reduce usage of unsafe.
Once Rust has generic associated types,
- [ ] Add support for repeatedly viewing mutable elements (where the item type can use the lifetime of the borrow of the source). This will allow FoldInplaceAxes to be implemented in terms of Zip.
- [ ] Add support for windows viewing mutable elements and buffered windows (to extend windows over the edges).
- [ ] Add AccumAxis (for e.g. cumulative sum).
Once Rust has specialization,
- [ ] Provide a generic .collect() method on NdProducer and specialize the implementation for FoldInplaceAxes.

ethanhs commented 2 years ago

Out of curiosity, what is the current status of this crate? I would be happy to help work on the above outlined features, and I would love to see faster iteration merged into ndarray.

jturner314 commented 2 years ago

I just haven't had a chance to work on it in a couple of years. I think the design is generally pretty good, although there are a few areas which could be improved. The last time I worked on this, I think it was pretty much ready for use; I just wanted to test it more thoroughly. (I didn't feel comfortable releasing it on crates.io without more thorough testing.) I started work on automated, randomized testing (via the proptest crate), but I haven't finished that yet. I plan to eventually to finish the testing stuff and release nditer (and possibly integrate it into ndarray); I just haven't had much time to work on it.

It'll probably be at least a few months before I can get back to this project. You may find it useful as-is. I'm not sure the best way to contribute to this repository right now, since my time is limited. You could always fork the project and then add some testing, adapt it to your needs, etc. For a more immediate impact on ndarray, you could use some of the ideas to speed up ndarray::Zip. (See optimize.rs for the iteration optimizations.)

Out of curiosity, I just ran the iteration benchmarks with an up-to-date version of ndarray. While nditer is faster than ndarray for large arrays, the difference isn't as large as I remember. For small arrays, ndarray is faster, since it doesn't perform the more expensive optimizations. (It should be possible to make nditer perform comparably well by using a heuristic to skip the optimizations for small arrays, but I haven't done that yet.) For larger arrays, if the array is contiguous and in C/Fortran order, the performance is effectively the same for ndarray and nditer. For larger arrays with non-standard layouts (discontiguous or in unusual order), nditer is 10–20% faster than ndarray in the benchmarks. I suspect that with even larger arrays, the difference would be larger. The equal_lengths_permuted_ixdyn benchmark is a notable exception to the moderate gains observed in the other benchmarks – nditer is about 5 times faster than ndarray for the largest case. I'd guess that this is due to allocations in the inner loop of ndarray's implementation for this case.

Larger gains are possible with loop tiling optimizations, which I haven't pushed to GitHub yet. When iterating over multiple arrays of differing layout, loop tiling can be multiple times faster than a naïve implementation. In some preliminary testing, I saw gains of >3x when comparing nditer with loop tiling to ndarray, Julia, NumPy, NumExpr, and xtensor for an operation on two 10001x10001 arrays of differing layouts. Providing a nice API for SIMD operations would also be useful from a performance perspective, although that's largely blocked on std::simd becoming stable.

Other than performance, the primary benefit of nditer is the more flexible iterator adapters. The API is more similar to std::iter::Iterator than ndarray::Zip. It's convenient for operations which are difficult to express with ndarray::Zip's more restrictive API.

ethanhs commented 2 years ago

OK that's good to know, thank you! I also would love to work on improving testing ndarray code, it should help testing another crate I want to work on.

Also std::simd is in nightly now so I think an API based on it could be iterated on.

jturner314 / nditer

Roadmap #1