Closed akern40 closed 1 month ago
Hold off on any merge for a moment; tril
panics when given a 0D array, working on a fix.
Ok, fixed.
Not to move conversations, but two thoughts on the (now resolved) discussion about order of the input array:
{zeros|ones|full}_like
public functions? That way users don't have to do the rather-convoluted self.raw_dim().set_f(is_layout_f(&self.dim, &self.strides))
that I'm doing. Happy to make a tracking issue and put in the PRs for that.tri[ul]
's performance on c-order vs f-order, and it's not great; results are below. The slowdown is quite apparent at larger array sizes, as expected. The via_t
lines are referring to the quick and dirty solution I could think of, which is x_tril = x.t().triu(0).t().to_owned()
.
running 3 tests (64 x 64)
test triu_c ... bench: 1,681 ns/iter (+/- 45)
test triu_f ... bench: 2,011 ns/iter (+/- 32)
test triu_f_via_t ... bench: 2,133 ns/iter (+/- 29)
running 3 tests (2048 x 2048) test triu_c ... bench: 2,418,649 ns/iter (+/- 320,729) test triu_f ... bench: 22,184,991 ns/iter (+/- 4,532,682) test triu_f_via_t ... bench: 4,259,493 ns/iter (+/- 218,849)
1. Would there be appetite for `{zeros|ones|full}_like` public functions? That way users don't have to do the rather-convoluted `self.raw_dim().set_f(is_layout_f(&self.dim, &self.strides))` that I'm doing. Happy to make a tracking issue and put in the PRs for that.
Well, I know that I'm interested in those functions, but it may not be what @bluss had in mind. The more methods we have, the harder it is to find them.
As for the benchmarks, it's kinda normal because those larger arrays don't fit in the cache and you're iterating in the wrong order. Seeing those results, I think adding a condition on the memory order is worth it. Some people (like me) often work in f-order :)
Well, I know that I'm interested in those functions, but it may not be what @bluss had in mind. The more methods we have, the harder it is to find them.
This is true; I don't want to bloat the API. That being said, I think some (difficult and time-consuming and I don't mean to imply it's trivial) documentation work could make the docs easier to navigate and alleviate concerns about more functions.
Some people (like me) often work in f-order :)
Me too :) too many column-vector algorithms / math. I was interested in seeing how well NumPy does on the transposed version and I guess they didn't write one that dispatches on order:
In [1]: import numpy as np
In [2]: x = np.ones((2048, 2048))
In [3]: %timeit np.triu(x)
4.34 ms ± 53.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [4]: x = np.ones((2048, 2048), order="F")
In [5]: %timeit np.triu(x)
19 ms ± 557 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
(Fun to see that ndarray
is faster than NumPy here, though! Although that extra bit of time could just be Python overhead, rather than raw NumPy performance).
If we want to create a specialization, I think we can do it via the transpose-transpose method I mentioned above, with a slight tweak to avoid the copy: (EDITED)
pub fn triu(&self, k: isize) -> Array<A, D> {
match self.ndim() > 1 && is_layout_f(&self.dim, &self.strides) {
true => {
let n = self.ndim();
let mut x = self.view();
x.swap_axes(n - 2, n - 1);
let mut tril = x.tril(-k).into_shared();
tril.swap_axes(n - 2, n - 1);
tril.into_owned()
},
false => {
let mut res = Array::zeros(self.raw_dim());
Zip::indexed(self.rows())
.and(res.rows_mut())
.for_each(|i, src, mut dst| {
let row_num = i.into_dimension().last_elem();
let lower = max(row_num as isize + k, 0);
dst.slice_mut(s![lower..]).assign(&src.slice(s![lower..]));
});
res
}
}
}
which gets us the following performance: (EDITED)
running 3 tests (64 x 64)
test triu_c ... bench: 1,709 ns/iter (+/- 58)
test triu_f ... bench: 1,723 ns/iter (+/- 43)
running 3 tests (2048 x 2048)
test triu_c ... bench: 2,429,621 ns/iter (+/- 171,843)
test triu_f ... bench: 2,484,196 ns/iter (+/- 159,421)
If this gets some thumbs up I'll add it to the PR (and maybe squash at this point to reduce commits?)
Glad I got to explore the _owned
methods - seems like it should be possible to do non-copy _owned
on mutable views, but that's an improvement for another day / PR.
Thank you for your contribution @akern40
The implementation seems to work, but unsure about whether this should be a free function vs bound directly to ArrayBase vs a trait. Closes #1323.