bluss commented 5 years ago

(This is a collaborative issue, please edit and add points, if applicable, or join the discussion in the issue below)

For context, please read the ArrayBase documentation on Arithmetic Operations first.

Goals

Ease of use: Arithmetic operations should be available when the user expects it
Transparency: It should be possible to understand how much copying and allocation is involved in an operation
Performance A. Allocate only what's needed and reuse when possible B. Copy only what's needed, compute and insert in place otherwise C. Autovectorization and vectorization — single threaded performance of the computation

Non-Goals

Parallelization and multithreading is not in scope for this issue

Prioritization

I think we should continue to focus on primitive numeric types as operands, these are always the most important, while improving the situation for generic array elements. This focus means that we can accept solutions that are perfect for primitives but not yet perfect for other elements.

Known Problems in Current Implementation

Excessive copying of the whole array.
- Example: In &A @ &A we use self.to_owned().add(rhs) to implement it, while it could be implemented without copying the first operand's data
Excessive copying of elements: We use elt1.clone() + elt2.clone() in some places.
- The alternative would be: Use by-reference operators or other general interface
- Copying primitive numeric types (integers, floats) like this is not a problem, and for this reason, the current implementation only has a problem when concerning non-primitive array elements.
Right hand side and left hand side scalar operands are implemented in completely different ways, due to Rust limitations in how this is expressed.
- We should continue to strive for symmetry in which operands are supported on the lhs and rhs. But as a compromise, in generic code, we can document that users sometimes must prefer array-scalar operations where the scalar is the right hand side operand(?)

Expanding the Number of Implementations

Proposed solution: Where a &A is expected, also permit consuming an array A
- Before: "a += &x.dot(y);"
- After: "a += x.dot(y);"
- Drawback is that it compiles despite wasting an allocated array - in some cases, it could have been done in-place instead, maybe avoiding the allocated array altogether
- But note, an in place operation could be more efficent - Zip allows the user many ways to write in place operations themselves, and in this case, there's general_mat_mut which can perform the operation A += X × Y in place.
Proposed improvement: Where &A is expected, also permit &mut A
- But this is a combinatoric explosion. &A @ &A turns into four possibilities if we permit both &A and &mut A - How to avoid this?

Which solution is better for compile time?

A. Make impl blocks more generic (admitting more array kinds per impl, for example admitting both &A and &mut A) B. Expand the number of impls to cover all cases (for example, one for each combination of &A/&mut A)

Consider both plain ndarray "cargo build" compile time, and compile time when ndarray is used in a project and compiles to non-generic code.

Co-broadcasting for Dynamic dimensionality

For static dimensionality array operations we use right hand side broadcasting: in A @ B, we can attempt to broadcast B to the shape of A.

For dynamic dimensionality operations, we can improve this to co-broadcasting so that A @ B can result in an array with a shape that's neither that of A or B.

Note: Co-broadcasting only ever expands the number of arrays that are compatible in operations, it does not change the result of operations that are already permitted by the right hand side broadcasting rule.

Related issues:

Additional arithmetic operations/variants #697
Previously decided: Don't mutate ArrayViewMut in place: Non-assignment binary operations shouldn't mutate input #83
(Tangentially related): and_broadcast_mut for Zip #478

bluss commented 5 years ago

About compile time

Remember: When working on the crate, compile time of the ndarray crate itself can seem “pretty good”. It's mostly generic code, we don't generate the machine code here, and we avoid the slowest part of compiling a rust program — feeding all the code to LLVM! The compile time when all the generic functionality is instantiated in the user's project might be quite something different. (This also happens in our tests).

There are some strategies to keep in mind for codegen compilation time:

Split code so that the main parts of it depend on as few generic parameters as possible (with few possible instantiations)
Reduce the space of possible type parameter values up front. a. Rely on coercions to support multiple types while implementing for a lesser number of them b. If using argument conversion traits, convert and then call a less generic function/method with the result of the conversion. (Example: AsRef etc).

But then there's the question of "Rust" focused compilation time:

As asked before - is it better with a few more general impls or many more specific ones?

Looking at ndarray's -Ztime-passes without incremental compilation, the following items use the most time:

  time: 0.338; rss: 176MB       coherence checking
  time: 8.565; rss: 180MB       wf checking
  time: 0.238; rss: 180MB       item-types checking
  time: 4.654; rss: 198MB       item-bodies checking
    time: 0.489; rss: 198MB     rvalue promotion + match checking
    time: 0.044; rss: 198MB     liveness checking + intrinsic checking
  time: 0.532; rss: 198MB       misc checking 2
  time: 0.000; rss: 198MB       borrow checking
  time: 4.346; rss: 229MB       MIR borrow checking
  time: 0.952; rss: 245MB       metadata encoding and writing
  time: 0.508; rss: 291MB       codegen
  time: 0.497; rss: 291MB       LLVM passes
time: 21.672; rss: 258MB                total

So it is indeed a mostly "Rust" bound compilation with little time in code generation. And we have a possibility that the Rust bound compilation passes will be proportional to the amount of new methods we define. I'd wager it would be good with fewer, more generic trait impls for this reason?

(And then apply tips 1 and 2 to decrease compile time in the end user application.)

LukeMathWalker commented 5 years ago

I agree on the goals and the outlined pain points/strategy.

A couple of points I'd like to add to the discussion:

Compilation time: as you highlighted @bluss, the time it takes to compile ndarray itself is a poor proxy for the actual time it takes to compile a crate that uses ndarray. It would be extremely useful to have an actual benchmark to measure the impact of proposed changes - somewhat similar to what the language team does with crater. I don't think there is the need to run a fullscale experiment across all crates out there that depend on ndarray, but it would quite interesting to pick a selection to use for benchmarking exercises. Ideally, a healthy mix of applications and libraries. A first "in-house" candidate could be the ndarray-examples repository, but it's probably too toy-like to be enough. Can we come up with a list of publicly available crates that we'd like to benchmark against?
Ease of discovery: things like Zip and general_mat_mut are great, but they are not the first thing a new user of ndarray will reach out for. We can make them easier to find and explain the different tradeoffs, but we can't deny that they increase the API surface that you need to master in order to be proficient and write code that has the best performance profile ndarray can offer you. Would it make sense to start investing time and effort in something similar to einsum in NumPy? The user-facing "front-end" remains simple and uniform (Einstein notation) and we can properly compile it down to the most optimised routine. @oracleofnj made a start on this with einsum - should we double down on it? It's probably going to be a lot of work - no point in underestimating it.

bluss commented 5 years ago

478 is related because it contains the idea of treating an array of `Cell<T>` elements as something that's mutable, for example that you can have on the left hand side of the `+=` operator.

While this is a load of factors to consider, we should not fix this in one pull request! We should understand the landscape, talk about plans and make gradual fixes.

jturner314 commented 5 years ago

In #744, I've proposed relaxing all S: DataOwned constraints to S: Data in the arithmetic ops and using .into_owned() to get an owned array to work on. This approach is no more expensive than the current version of ndarray. However, it does have the same "Excessive copying of the whole array." issue @bluss mentioned in the case where the LHS is not uniquely owned. Once we have a solution for that issue for &A @ &A, we can reuse it in the A @ &A impl. (We can add an .is_uniquely_owned() method to Data, which will determine whether to mutate the LHS in-place or create a brand new array without copying the LHS.)

jturner314 commented 5 years ago

One other comment on this issue -- we can implement co-broadcasting for arbitrary dimension types, not just IxDyn, with something like this:

pub trait PartialOrdDim<Rhs: Dimension>: Dimension {
    type Max: Dimension;
    type Min: Dimension;
    // possibly other useful stuff
}

impl<D: Dimension> PartialOrdDim<Ix3> for Ix2 {
    type Max = Ix3;
    type Min = Ix2;
}

impl<D: Dimension> PartialOrdDim<IxDyn> for Ix2 {
    type Max = IxDyn;
    type Min = IxDyn;
}

// ...

impl<'a, A, B, S, S2, D, E> $trt<&'a ArrayBase<S2, E>> for &'a ArrayBase<S, D>
where
    A: Clone + $trt<B>,
    B: Clone,
    S: Data<Elem=A>,
    S2: Data<Elem=B>,
    D: Dimension + PartialOrdDim<E>,
    E: Dimension,
{
    type Output = Array<<A as $trt<B>>::Output, <D as PartialOrdDim<E>>::Max>;
    fn $mth(self, rhs: &'a ArrayBase<S2, E>) -> Self::Output {
        // ...
    }
}

djc commented 4 years ago

FWIW, I'm just getting started with ndarray and basically didn't figure out how to do AddAssign with two 1-dimensional f32 arrays:

error[E0271]: type mismatch resolving `<ViewRepr<&mut f32> as RawData>::Elem == ArrayBase<ViewRepr<&f32>, Dim<[usize; 1]>>`
  --> tools/src/bin/build-index.rs:58:26
   |
58 |                 sentence += ArrayView1::from(&state.buf);
   |                          ^^ expected `f32`, found struct `ArrayBase`
   |
   = note: expected type `f32`
            found struct `ArrayBase<ViewRepr<&f32>, Dim<[usize; 1]>>`
   = note: required because of the requirements on the impl of `AddAssign<ArrayBase<ViewRepr<&f32>, Dim<[usize; 1]>>>` for `ArrayBase<ViewRepr<&mut f32>, Dim<[usize; 1]>>`

error[E0277]: the trait bound `ArrayBase<ViewRepr<&f32>, Dim<[usize; 1]>>: ScalarOperand` is not satisfied
  --> tools/src/bin/build-index.rs:58:26
   |
58 |                 sentence += ArrayView1::from(&state.buf);
   |                          ^^ the trait `ScalarOperand` is not implemented for `ArrayBase<ViewRepr<&f32>, Dim<[usize; 1]>>`
   |
   = note: required because of the requirements on the impl of `AddAssign<ArrayBase<ViewRepr<&f32>, Dim<[usize; 1]>>>` for `ArrayBase<ViewRepr<&mut f32>, Dim<[usize; 1]>>`

error[E0271]: type mismatch resolving `<ViewRepr<&f32> as RawData>::Elem == ArrayBase<ViewRepr<&f32>, Dim<[usize; 1]>>`
  --> tools/src/bin/build-index.rs:58:26
   |
58 |                 sentence += ArrayView1::from(&state.buf);
   |                          ^^ expected `f32`, found struct `ArrayBase`
   |
   = note: expected type `f32`
            found struct `ArrayBase<ViewRepr<&f32>, Dim<[usize; 1]>>`
   = note: required because of the requirements on the impl of `AddAssign` for `ArrayBase<ViewRepr<&f32>, Dim<[usize; 1]>>`
   = note: required because of the requirements on the impl of `AddAssign<ArrayBase<ViewRepr<&f32>, Dim<[usize; 1]>>>` for `ArrayBase<ViewRepr<&mut f32>, Dim<[usize; 1]>>`

error[E0277]: the trait bound `ViewRepr<&f32>: DataMut` is not satisfied
  --> tools/src/bin/build-index.rs:58:26
   |
58 |                 sentence += ArrayView1::from(&state.buf);
   |                          ^^ the trait `DataMut` is not implemented for `ViewRepr<&f32>`
   |
   = help: the following implementations were found:
             <ViewRepr<&'a mut A> as DataMut>
   = note: required because of the requirements on the impl of `AddAssign` for `ArrayBase<ViewRepr<&f32>, Dim<[usize; 1]>>`
   = note: required because of the requirements on the impl of `AddAssign<ArrayBase<ViewRepr<&f32>, Dim<[usize; 1]>>>` for `ArrayBase<ViewRepr<&mut f32>, Dim<[usize; 1]>>`

Not sure if that's related to this issue, but I looked for examples doing this kind of thing and a bunch of the documentation on ArrayBase but failed to figure it out for now (and fell back to simple loops).

mulimoen commented 4 years ago

@djc You might need to make a reference to ArrayView:

sentence += &ArrayView1::from(&state.buf)

bluss commented 3 years ago

The docs cover this in the designated section, with examples https://docs.rs/ndarray/0.14.0/ndarray/struct.ArrayBase.html#arithmetic-operations

rust-ndarray / ndarray

Improved arithmetic operations for arrays #699

Goals

Non-Goals

Prioritization

Known Problems in Current Implementation

Expanding the Number of Implementations

Co-broadcasting for Dynamic dimensionality

Related issues:

478 is related because it contains the idea of treating an array of `Cell<T>` elements as something that's mutable, for example that you can have on the left hand side of the `+=` operator.

rust-ndarray / ndarray

Improved arithmetic operations for arrays #699

Goals

Non-Goals

Prioritization

Known Problems in Current Implementation

Expanding the Number of Implementations

Co-broadcasting for Dynamic dimensionality

Related issues:

478 is related because it contains the idea of treating an array of Cell<T> elements as something that's mutable, for example that you can have on the left hand side of the += operator.

478 is related because it contains the idea of treating an array of `Cell<T>` elements as something that's mutable, for example that you can have on the left hand side of the `+=` operator.