rust-ndarray / ndarray

ndarray: an N-dimensional array with array views, multidimensional slicing, and efficient operations
https://docs.rs/ndarray/
Apache License 2.0
3.43k stars 295 forks source link

Adds `triu` and `tril` methods directly to ArrayBase #1386

Closed akern40 closed 1 month ago

akern40 commented 1 month ago

The implementation seems to work, but unsure about whether this should be a free function vs bound directly to ArrayBase vs a trait. Closes #1323.

akern40 commented 1 month ago

Hold off on any merge for a moment; tril panics when given a 0D array, working on a fix.

akern40 commented 1 month ago

Ok, fixed.

Not to move conversations, but two thoughts on the (now resolved) discussion about order of the input array:

  1. Would there be appetite for {zeros|ones|full}_like public functions? That way users don't have to do the rather-convoluted self.raw_dim().set_f(is_layout_f(&self.dim, &self.strides)) that I'm doing. Happy to make a tracking issue and put in the PRs for that.
  2. I did a quick benchmark of tri[ul]'s performance on c-order vs f-order, and it's not great; results are below. The slowdown is quite apparent at larger array sizes, as expected. The via_t lines are referring to the quick and dirty solution I could think of, which is x_tril = x.t().triu(0).t().to_owned().
    
    running 3 tests (64 x 64)
    test triu_c       ... bench:       1,681 ns/iter (+/- 45)
    test triu_f       ... bench:       2,011 ns/iter (+/- 32)
    test triu_f_via_t ... bench:       2,133 ns/iter (+/- 29)

running 3 tests (2048 x 2048) test triu_c ... bench: 2,418,649 ns/iter (+/- 320,729) test triu_f ... bench: 22,184,991 ns/iter (+/- 4,532,682) test triu_f_via_t ... bench: 4,259,493 ns/iter (+/- 218,849)

nilgoyette commented 1 month ago
1. Would there be appetite for `{zeros|ones|full}_like` public functions? That way users don't have to do the rather-convoluted `self.raw_dim().set_f(is_layout_f(&self.dim, &self.strides))` that I'm doing. Happy to make a tracking issue and put in the PRs for that.

Well, I know that I'm interested in those functions, but it may not be what @bluss had in mind. The more methods we have, the harder it is to find them.

As for the benchmarks, it's kinda normal because those larger arrays don't fit in the cache and you're iterating in the wrong order. Seeing those results, I think adding a condition on the memory order is worth it. Some people (like me) often work in f-order :)

akern40 commented 1 month ago

Well, I know that I'm interested in those functions, but it may not be what @bluss had in mind. The more methods we have, the harder it is to find them.

This is true; I don't want to bloat the API. That being said, I think some (difficult and time-consuming and I don't mean to imply it's trivial) documentation work could make the docs easier to navigate and alleviate concerns about more functions.

Some people (like me) often work in f-order :)

Me too :) too many column-vector algorithms / math. I was interested in seeing how well NumPy does on the transposed version and I guess they didn't write one that dispatches on order:

In [1]: import numpy as np

In [2]: x = np.ones((2048, 2048))

In [3]: %timeit np.triu(x)
4.34 ms ± 53.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: x = np.ones((2048, 2048), order="F")

In [5]: %timeit np.triu(x)
19 ms ± 557 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

(Fun to see that ndarray is faster than NumPy here, though! Although that extra bit of time could just be Python overhead, rather than raw NumPy performance).

If we want to create a specialization, I think we can do it via the transpose-transpose method I mentioned above, with a slight tweak to avoid the copy: (EDITED)

pub fn triu(&self, k: isize) -> Array<A, D> {
    match self.ndim() > 1 && is_layout_f(&self.dim, &self.strides) {
        true => {
            let n = self.ndim();
            let mut x = self.view();
            x.swap_axes(n - 2, n - 1);
            let mut tril = x.tril(-k).into_shared();
            tril.swap_axes(n - 2, n - 1);
            tril.into_owned()
        },
        false => {
            let mut res = Array::zeros(self.raw_dim());
            Zip::indexed(self.rows())
                .and(res.rows_mut())
                .for_each(|i, src, mut dst| {
                    let row_num = i.into_dimension().last_elem();
                    let lower = max(row_num as isize + k, 0);
                    dst.slice_mut(s![lower..]).assign(&src.slice(s![lower..]));
                });

            res
        }
    }
}

which gets us the following performance: (EDITED)

running 3 tests (64 x 64)
test triu_c       ... bench:       1,709 ns/iter (+/- 58)
test triu_f       ... bench:       1,723 ns/iter (+/- 43)

running 3 tests (2048 x 2048)
test triu_c       ... bench:   2,429,621 ns/iter (+/- 171,843)
test triu_f       ... bench:   2,484,196 ns/iter (+/- 159,421)

If this gets some thumbs up I'll add it to the PR (and maybe squash at this point to reduce commits?)

Glad I got to explore the _owned methods - seems like it should be possible to do non-copy _owned on mutable views, but that's an improvement for another day / PR.

nilgoyette commented 1 month ago

Thank you for your contribution @akern40