Make ops aware of rvalues: astype/as_strided/copy/full

zcbenz commented 2 months ago

Proposed changes

When compositing transforms lots of temporary of arrays will be created and passed to next primitive, and by making ops accepting args by value we can avoid lots of copies of temporary arrays.

Checklist

Put an x in the boxes that apply.

[x] I have read the CONTRIBUTING document
[x] I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
[ ] I have added tests that prove my fix is effective or that my feature works
[ ] I have updated the necessary documentation (if needed)

angeloskath commented 2 months ago

I don't know how I feel about this change. I appreciate the possible efficiency benefits but I am not too sure it is worth having to pollute the code with std::move everywhere.

In the current state of the codebase, when passing an array to an op there is a copy into the inputs vector of the returned array. This would indeed cause a couple of copies that can be avoided when chaining operations. However, making pass by copy the only option results in a significantly worse experience when writing ops imho.

Current API

array x = ones(10, float32);
array a = array(2, float32);
array b = array(1, float32);
x = a * x + b;  // 2 copies when doing a*x and 2 more for + b
x = add(multiply(x, b), c);  // same as above

Proposed API

array x = ones(10, float32);
array a = array(2, float32);
array b = array(1, float32);
x = a * x + b;  // 2 copies when doing a*x and 1 more for + b
x = add(multiply(std::move(x), a), b); // 1 copy for a*x and 1 more for +b

So there is an improvement (maybe I missed one copy of a temporary there I am not sure how good the compiler would be to figure that out) but there needs to be a move on the outside and there need to be moves on the inside of the ops. Moving an and b would reduce more copies but would be even harder to call. Passing temporaries would auto move which is pretty cool though.

zcbenz commented 2 months ago

This change does not mean to reduce the copies from user's side, it means to remove the internal copies inside the mlx library as much as possible.

Using add as example:

array add(const array& a, const array& b, StreamOrDevice s /* = {} */) {
  ...
  inputs = broadcast_arrays(astype(a, out_type, s), astype(b, out_type, s), s);
  ...
  return array(..., std::move(inputs));
}

Calling add(a, b) has 4 copies of arrays at least: 2 caused by astype, and 2 caused by broadcast_arrays. With move semantics there will only be 2 copies left, i.e. 50% less copies.

Using your example add(multiply(x, b), c), with move semantics there is not only 1 copy saved, but 5: one implicit copy from user's side, 2 internal copies in add, and 2 internal copies in multiply.

There are of course lots of ops that don't have internal copies, like negative. But there are also ops that do deep compositions, like logsumexp, and move semantics will cut more than half of internal copies.

I didn't do any analysis of how frequently the ops are used in real worlds, but just by skimming the code, I believe the number of array copies will be at least halved in real code if all ops support move semantics.

The major downside is, we have to use lots of std::move in the implementations, which I don't like neither.

But no std::move does not mean clearer code, you still have to pollute the code with const array& everywhere, which makes code hard to read. And the API is deceiving: users would think there is no copy of argument because the API takes it by const reference, but in reality the arg is copied every time.

On the other hand, I totally get why you don't want this change, I don't like changes transforming coding styles in my projects neither.

I don't tend to force changing this library to use pass-by-value everywhere, I'm just studying this library and found there are places that can be improved and then spent some time tweaking things, which helps me understand the code better.

awni commented 2 months ago

Fwiw array copy time is unfortunately non-trivial. It's really just an increment and decrement on a shared pointer which is unfortunately very slow on macOS (I believe the ref count is an atomic). The move avoids that entirely since you don't need to touch the ref count. Just to know what it buys us, here's a little benchmark I ran on my machine:

  array a({1,2,3});
  auto fun = [&]() {
    for (int i = 0; i < 1000; ++i) {                                           
      array b = a;
      a = b;
    }                                                                          
  };

Versus with a move:

  array a({1,2,3});
  auto fun = [&]() {
    for (int i = 0; i < 1000; ++i) {                                           
      array b = std::move(a);
      a = std::move(b);
    }                                                                          
  };

The timings are:

0.017 ms for the copy vs 0.0000175 ms for the move.

awni commented 2 months ago

Given the above benchmark I'm willing to pay the cost of using std::move inside some ops (as long as it doesn't make using the API slower without moves, which in this case it should not since we need to do the copy anyway).

angeloskath commented 2 months ago

Well that is only 1000x faster. My comment was mostly to avoid having to write the following

auto dt = y.dtype();
add(std::move(x), multiply(std::move(y), array(2.0, dt)));

instead of

add(x, multiply(y, array(2.0, y.dtype()));

Otoh I agree that this can be kept internal a bit so the user would still use the latter and pay 3 copies which would be paid anyway when these were put into the input vectors.

angeloskath commented 2 months ago

Another point is obviously favoring completely functional code (which I do believe it looks kind of nice) and the compiler will take care of moving the temporary for us. For instance

 auto mu = mean(x, /* axis= */ -1, /* keepdims= */ true, s);
 auto mu2 = square(mu, s);
 auto x2 = mean(square(x, s), /* axis= */ -1, /* keepdims= */ true, s);
 auto v = subtract(x2, mu2, s);

would have to be made into

auto mu = mean(x, /* axis= */ -1, /* keepdims= */ true, s);
auto v = subtract(mean(square(x, s), /* axis= */ -1, /* keepdims= */ true, s),
                  square(mu, s), s);

awni commented 2 months ago

I agree that this can be kept internal a bit so the user would still use the latter and pay 3 copies which would be paid anyway when these were put into the input vectors.

Yea exactly. I don't even think we have to be dogmatic about it ourselve, even in a few of the small ops that get used a lot (e.g. broadcast, astype, reshape, transpose) might be enough.

Another point is obviously favoring completely functional code (which I do believe it looks kind of nice) and the compiler will take care of moving the temporary for us. For instance

I like that style too. If we started using the stream context manager we could make it look even nicer with operator overloads.

awni commented 2 months ago

So I tested to see how many copies / moves we do from different variations and its kind of interesting / unexpected.

The following does two copies (not one), one into the initializer list and the next into the vector for the array's inputs.

array op(const array& x) {                                                     
  return array({1}, float32, nullptr, {x});
} 

void test_move() {
  array a(1.0);                                                                
  std::cout << "START " << std::endl;                                          
  auto b = op(a);
  std::cout << "STOP " << std::endl;
}

This one does two moves and one copy (not 0): move into the argument x, move into the initializer list and copy into the vector of inputs:

array op(array x) {
  return array({1}, float32, nullptr, {std::move(x)});
} 

void test_move() {
  array a(1.0);                                                                
  std::cout << "START " << std::endl;                                          
  auto b = op(std::move(a));
  std::cout << "STOP " << std::endl;
}

This version does only two moves, no copies:

array op(array x) {
  std::vector<array> inputs;
  inputs.push_back(std::move(x));
  return array({1}, float32, nullptr, std::move(inputs));
} 

void test_move() {
  array a(1.0);                                                                
  std::cout << "START " << std::endl;                                          
  auto b = op(std::move(a));
  std::cout << "STOP " << std::endl;
}

awni commented 2 months ago

It would be nice if there was a way to make the vector as an rvalue without incurring a copy, but I haven't been able to find it... possibly implementation dependent..

zcbenz commented 2 months ago

It would be nice if there was a way to make the vector as an rvalue without incurring a copy

There is a trick to eliminate the copy but it is not necessarily to be fast, see also: https://burnicki.pl/en/2022/08/14/initializing-with-std-initializer_list-involves-copying.html

There is a bigger problem with std::vector though: it stores things on heap so every time you create or copy a vector you are doing a heap allocation and it is slow, MLX does things like op({array({0})}, {1}) a lot and this is hurting.

One common solution is to use a custom vector type that stores small numbers of elements on stack and only use heap when the number of elements is large, i.e. the vector version of Small String Optimization. Projects like V8 and LLVM use the custom vector, and there are also implementations in absl/folly/boost.

And by using a custom vector type we can make its constructor accept rvalue arguments directly instead of initializer_list, which solves the problem of arrays being copied into vector.

I can give it a try if you are okay with it.

So even though std::vector cripples the benefits of move semantics, I think MLX should still use move semantics when feasible, because it is a solvable problem, and I think you probably want to solve it anyway to get rid of the overhead.

awni commented 2 months ago

Regarding small vector type, indeed it occured to me to have one for shape and strides since we copy those around a lot as well. We are also considering moving to an explicit shape type rather than std::vector<int> so we can have more flexibility. But that is a big change..

I can give it a try if you are okay with it.

For passing arrays around it could be useful. But I think we would actually need to see some benchmarks to understand how much we gain so we can evaluate the pros and cons.

If you are interested in investigating that, the first thing I would do is set you up with some benchmarks so you can measure the time of graph construction overheads. And then see if using the small vector type improves anything.

awni commented 2 months ago

PS that's a nice article (https://burnicki.pl/en/2022/08/14/initializing-with-std-initializer_list-involves-copying.html), thanks for sharing. This doesn't look too bad:

template <typename T, typename... U>
std::vector<std::decay_t<T>> make_vector(T&& arg, U&&... args) {

it is not necessarily to be fast

Why do you say that? It should be faster than what we do now (?) which is both the allocation and copying from the initializer list.

zcbenz commented 2 months ago

Why do you say that? It should be faster than what we do now (?) which is both the allocation and copying from the initializer list.

It turns a constructor call into a series of emplace_back, which adds its own overhead, so I’m not really sure without benchmarking.

awni commented 2 months ago

If you are interested in investigating that, the first thing I would do is set you up with some benchmarks so you can measure the time of graph construction overheads. And then see if using the small vector type improves anything.

Let me know if you'd like me to share some benchmarks.

zcbenz commented 2 months ago

If you are interested in investigating that, the first thing I would do is set you up with some benchmarks so you can measure the time of graph construction overheads. And then see if using the small vector type improves anything.

Let me know if you'd like me to share some benchmarks.

I have an idea that would allow us to convert the codebase to use small vector piece by piece instead of doing it all at once, I will experiment with the idea first. Some performance gain does not worth a change with thousands of lines, but might worth it if we can do it with small steps.

angeloskath commented 2 months ago

No more thoughts :-) . I 'm merging it, we can play it by ear on whether the code gets ugly with a lot of std::move calls everywhere.

ml-explore / mlx

Make ops aware of rvalues: astype/as_strided/copy/full #895

Proposed changes

Checklist