Open bluss opened 5 years ago
About compile time
Remember: When working on the crate, compile time of the ndarray
crate itself can seem “pretty good”. It's mostly generic code, we don't generate the machine code here, and we avoid the slowest part of compiling a rust program — feeding all the code to LLVM! The compile time when all the generic functionality is instantiated in the user's project might be quite something different. (This also happens in our tests).
There are some strategies to keep in mind for codegen compilation time:
But then there's the question of "Rust" focused compilation time:
Looking at ndarray's -Ztime-passes without incremental compilation, the following items use the most time:
time: 0.338; rss: 176MB coherence checking
time: 8.565; rss: 180MB wf checking
time: 0.238; rss: 180MB item-types checking
time: 4.654; rss: 198MB item-bodies checking
time: 0.489; rss: 198MB rvalue promotion + match checking
time: 0.044; rss: 198MB liveness checking + intrinsic checking
time: 0.532; rss: 198MB misc checking 2
time: 0.000; rss: 198MB borrow checking
time: 4.346; rss: 229MB MIR borrow checking
time: 0.952; rss: 245MB metadata encoding and writing
time: 0.508; rss: 291MB codegen
time: 0.497; rss: 291MB LLVM passes
time: 21.672; rss: 258MB total
So it is indeed a mostly "Rust" bound compilation with little time in code generation. And we have a possibility that the Rust bound compilation passes will be proportional to the amount of new methods we define. I'd wager it would be good with fewer, more generic trait impls for this reason?
(And then apply tips 1 and 2 to decrease compile time in the end user application.)
I agree on the goals and the outlined pain points/strategy.
A couple of points I'd like to add to the discussion:
ndarray
itself is a poor proxy for the actual time it takes to compile a crate that uses ndarray
. It would be extremely useful to have an actual benchmark to measure the impact of proposed changes - somewhat similar to what the language team does with crater
.
I don't think there is the need to run a fullscale experiment across all crates out there that depend on ndarray
, but it would quite interesting to pick a selection to use for benchmarking exercises. Ideally, a healthy mix of applications and libraries.
A first "in-house" candidate could be the ndarray-examples
repository, but it's probably too toy-like to be enough. Can we come up with a list of publicly available crates that we'd like to benchmark against?Zip
and general_mat_mut
are great, but they are not the first thing a new user of ndarray
will reach out for. We can make them easier to find and explain the different tradeoffs, but we can't deny that they increase the API surface that you need to master in order to be proficient and write code that has the best performance profile ndarray
can offer you. Would it make sense to start investing time and effort in something similar to einsum
in NumPy
?
The user-facing "front-end" remains simple and uniform (Einstein notation) and we can properly compile it down to the most optimised routine. @oracleofnj made a start on this with einsum
- should we double down on it?
It's probably going to be a lot of work - no point in underestimating it. Cell<T>
elements as something that's mutable, for example that you can have on the left hand side of the +=
operator.While this is a load of factors to consider, we should not fix this in one pull request! We should understand the landscape, talk about plans and make gradual fixes.
In #744, I've proposed relaxing all S: DataOwned
constraints to S: Data
in the arithmetic ops and using .into_owned()
to get an owned array to work on. This approach is no more expensive than the current version of ndarray
. However, it does have the same "Excessive copying of the whole array." issue @bluss mentioned in the case where the LHS is not uniquely owned. Once we have a solution for that issue for &A @ &A
, we can reuse it in the A @ &A
impl. (We can add an .is_uniquely_owned()
method to Data
, which will determine whether to mutate the LHS in-place or create a brand new array without copying the LHS.)
One other comment on this issue -- we can implement co-broadcasting for arbitrary dimension types, not just IxDyn
, with something like this:
pub trait PartialOrdDim<Rhs: Dimension>: Dimension {
type Max: Dimension;
type Min: Dimension;
// possibly other useful stuff
}
impl<D: Dimension> PartialOrdDim<Ix3> for Ix2 {
type Max = Ix3;
type Min = Ix2;
}
impl<D: Dimension> PartialOrdDim<IxDyn> for Ix2 {
type Max = IxDyn;
type Min = IxDyn;
}
// ...
impl<'a, A, B, S, S2, D, E> $trt<&'a ArrayBase<S2, E>> for &'a ArrayBase<S, D>
where
A: Clone + $trt<B>,
B: Clone,
S: Data<Elem=A>,
S2: Data<Elem=B>,
D: Dimension + PartialOrdDim<E>,
E: Dimension,
{
type Output = Array<<A as $trt<B>>::Output, <D as PartialOrdDim<E>>::Max>;
fn $mth(self, rhs: &'a ArrayBase<S2, E>) -> Self::Output {
// ...
}
}
FWIW, I'm just getting started with ndarray and basically didn't figure out how to do AddAssign
with two 1-dimensional f32
arrays:
error[E0271]: type mismatch resolving `<ViewRepr<&mut f32> as RawData>::Elem == ArrayBase<ViewRepr<&f32>, Dim<[usize; 1]>>`
--> tools/src/bin/build-index.rs:58:26
|
58 | sentence += ArrayView1::from(&state.buf);
| ^^ expected `f32`, found struct `ArrayBase`
|
= note: expected type `f32`
found struct `ArrayBase<ViewRepr<&f32>, Dim<[usize; 1]>>`
= note: required because of the requirements on the impl of `AddAssign<ArrayBase<ViewRepr<&f32>, Dim<[usize; 1]>>>` for `ArrayBase<ViewRepr<&mut f32>, Dim<[usize; 1]>>`
error[E0277]: the trait bound `ArrayBase<ViewRepr<&f32>, Dim<[usize; 1]>>: ScalarOperand` is not satisfied
--> tools/src/bin/build-index.rs:58:26
|
58 | sentence += ArrayView1::from(&state.buf);
| ^^ the trait `ScalarOperand` is not implemented for `ArrayBase<ViewRepr<&f32>, Dim<[usize; 1]>>`
|
= note: required because of the requirements on the impl of `AddAssign<ArrayBase<ViewRepr<&f32>, Dim<[usize; 1]>>>` for `ArrayBase<ViewRepr<&mut f32>, Dim<[usize; 1]>>`
error[E0271]: type mismatch resolving `<ViewRepr<&f32> as RawData>::Elem == ArrayBase<ViewRepr<&f32>, Dim<[usize; 1]>>`
--> tools/src/bin/build-index.rs:58:26
|
58 | sentence += ArrayView1::from(&state.buf);
| ^^ expected `f32`, found struct `ArrayBase`
|
= note: expected type `f32`
found struct `ArrayBase<ViewRepr<&f32>, Dim<[usize; 1]>>`
= note: required because of the requirements on the impl of `AddAssign` for `ArrayBase<ViewRepr<&f32>, Dim<[usize; 1]>>`
= note: required because of the requirements on the impl of `AddAssign<ArrayBase<ViewRepr<&f32>, Dim<[usize; 1]>>>` for `ArrayBase<ViewRepr<&mut f32>, Dim<[usize; 1]>>`
error[E0277]: the trait bound `ViewRepr<&f32>: DataMut` is not satisfied
--> tools/src/bin/build-index.rs:58:26
|
58 | sentence += ArrayView1::from(&state.buf);
| ^^ the trait `DataMut` is not implemented for `ViewRepr<&f32>`
|
= help: the following implementations were found:
<ViewRepr<&'a mut A> as DataMut>
= note: required because of the requirements on the impl of `AddAssign` for `ArrayBase<ViewRepr<&f32>, Dim<[usize; 1]>>`
= note: required because of the requirements on the impl of `AddAssign<ArrayBase<ViewRepr<&f32>, Dim<[usize; 1]>>>` for `ArrayBase<ViewRepr<&mut f32>, Dim<[usize; 1]>>`
Not sure if that's related to this issue, but I looked for examples doing this kind of thing and a bunch of the documentation on ArrayBase
but failed to figure it out for now (and fell back to simple loops).
@djc You might need to make a reference to ArrayView
:
sentence += &ArrayView1::from(&state.buf)
The docs cover this in the designated section, with examples https://docs.rs/ndarray/0.14.0/ndarray/struct.ArrayBase.html#arithmetic-operations
(This is a collaborative issue, please edit and add points, if applicable, or join the discussion in the issue below)
For context, please read the ArrayBase documentation on Arithmetic Operations first.
Goals
Non-Goals
Parallelization and multithreading is not in scope for this issue
Prioritization
Known Problems in Current Implementation
self.to_owned().add(rhs)
to implement it, while it could be implemented without copying the first operand's dataelt1.clone() + elt2.clone()
in some places.Expanding the Number of Implementations
Zip
allows the user many ways to write in place operations themselves, and in this case, there'sgeneral_mat_mut
which can perform the operation A += X × Y in place.Which solution is better for compile time?
A. Make impl blocks more generic (admitting more array kinds per impl, for example admitting both &A and &mut A) B. Expand the number of impls to cover all cases (for example, one for each combination of &A/&mut A)
Consider both plain ndarray "cargo build" compile time, and compile time when ndarray is used in a project and compiles to non-generic code.
Co-broadcasting for Dynamic dimensionality
For static dimensionality array operations we use right hand side broadcasting: in A @ B, we can attempt to broadcast B to the shape of A.
For dynamic dimensionality operations, we can improve this to co-broadcasting so that A @ B can result in an array with a shape that's neither that of A or B.
Note: Co-broadcasting only ever expands the number of arrays that are compatible in operations, it does not change the result of operations that are already permitted by the right hand side broadcasting rule.
Related issues: