iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.48k stars 553 forks source link

Move DecomposeSoftmax to GlobalOptimization. #17469

Open bjacob opened 1 month ago

bjacob commented 1 month ago

Currently, DecomposeSoftmax is under Codegen/. Here's a link to it, specifically to a helpful comment explaining what it rewrites softmax to: https://github.com/iree-org/iree/blob/1316c92341ec5aa320f44ad4c8be9771b20e61e7/compiler/src/iree/compiler/Codegen/Common/DecomposeSoftmax.cpp#L19-L36

Problems with that:

The main action to be taken here is to move DecomposeSoftmax to the appropriate point in Flow, before dispatch region formation.

Then, tweak the dispatch fusion logic to achieve the following goal: no buffers larger than some small fixed size should be passed on the stack, so whenever a potentially large buffer is passed between steps in the above-linked comment, that should be a dispatch region boundary.

In the 1-dimensional case (which is the most common case), that means that steps 1. and 2. can go in one dispatch, and then steps 3. and 4. can go in another. The buffer produced by step 2 can't go on the stack.

In the higher-dimensional case, potentially every step might have to be its own dispatch, but I suppose that a common sub-case of that might be when some inner dimensions are static and small, in which case it boils down to the 1-dimensional case.

Finally, perhaps as a follow-up unblocked by this groundwork, ensure that suitable producers are fused into the first dispatch (ie the one typically containing steps 1 and 2 from the description).

qedawkins commented 1 month ago

On GPU we always rematerialize the result of step 2 with both steps 2 and 3 to avoid a large shared memory allocation. Is this trying to avoid that rematerialization?

benvanik commented 1 month ago

this doesn't belong in flow either - global opt maybe

bjacob commented 1 month ago

Is this trying to avoid that rematerialization?

Trying to avoid all stack buffers (that are not tiny static sized).

qedawkins commented 1 month ago

+1 global opt would be the right place if we did want to do that decomposition earlier on, although I think the main question here is whether we want decomposition before or after dispatch formation. I intuitively agree that decomposing earlier makes sense given that codegen doesn't do anything special with it. IIUC there were raising patterns added to force the fusion that Flow should be doing by default. The other question is what kind of fusion codegen should do. One GPU, I think any pattern like this should almost always become one kernel

%0 = reduction tensor<?> -> tensor<1>
%1 = broadcast (+ elemwise) %0 : tensor<1> -> tensor<?>
%2 = reduction %1 tensor<?> -> tensor<1>

Because the cost of the additional kernel often outweighs the cost of computing the elementwise part twice.

bjacob commented 1 month ago

IIUC there were raising patterns added to force the fusion that Flow should be doing by default.

That's what I'm questioning here: I don't think we should ever have a stack buffer that isn't really tiny, we are going to have users whose entire thread stack is 64k, and other users who (rightly) see any usage of the stack to hold buffers in a computation as a security red flag, so we should structure how we think of these things around that basic requirement -- if any tensor is potentially not-tiny (even after vectorization) then it can't be an intermediate within a dispatch, that would become a stack buffer.

Currently we have in https://github.com/iree-org/iree/issues/17465#issue-2309327946 a 16384x16384xf32 softmax, that gets tiled into 16x16384xf32, so that is using a 1M stack buffer even after tiling - that's ~ 100x too large.

qedawkins commented 1 month ago

For the case you linked, it should have rematerialized step 2 and never formed such a stack buffer. That's what we do on GPU with this pass: https://github.com/iree-org/iree/blob/main/compiler/src/iree/compiler/Codegen/Common/RematerializeParallelOps.cpp

Independent of that, I think moving the decomposition to global op makes sense because we shouldn't be relying on a special raising, but that shouldn't change the dispatches we form.

benvanik commented 1 month ago

Question as to whether this is an artifact of the problem or of our tiling - if it's an issue with our tiling choices that feels like the thing to focus on instead of decomposing the problem to work around them?

that's what we do on GPU with this pass

Good point. It's extremely bad design if we are making dispatch region formation decisions based on targets - that to me has a big smell of issues in codegen and not something we should be changing the whole program for. If this can run on GPUs without big stacks it should be able to run on CPUs without big stacks.

We really shouldn't have passes like this at all, and preprocessing may be an even better place as that's where we stick the optional "these are workarounds for bad codegen" passes that we want to be able to remove in the future.

bjacob commented 1 month ago

Oh... the whole concept of rematerialization is news to me here. Is this a mechanism whereby a local intermediate tensor inside a dispatch function, which would otherwise become a stack buffer, gets instead moved to some other storage class? I tried getting an answer from the test but I'm missing what part of the // CHECK to look at ? https://github.com/iree-org/iree/blob/f7ca45d5d58898a2bf9b27c4c7ac09b3b2b48f8f/compiler/src/iree/compiler/Codegen/Common/test/rematerialize_parallel_ops.mlir#L106

benvanik commented 1 month ago

No, that's just within the dispatches - it does not create new dispatches or global tensors.

bjacob commented 1 month ago

No, that's just within the dispatches - it does not create new dispatches or global tensors.

Ah... ok but then I need help understanding @qedawkins above:

For the case you linked, it should have rematerialized step 2 and never formed such a stack buffer.

How does rematerialization avoid a stack buffer?

qedawkins commented 1 month ago

Good point. It's extremely bad design if we are making dispatch region formation decisions based on targets

Rematerialization does appear to be the place where this becomes very unclear. If exp is abhorrently slow on a particular hardware architecture, that might be a fundamental reason to not do the rematerialization for that architecture, in which case what Benoit is suggesting to form multiple dispatches really might be the faster option. Luckily I don't see a reason to believe this is the case on CPU yet.

How does rematerialization avoid a stack buffer?

It changes the computation to this

 /// 1. Compute the max of x along dimension d. This results 
 ///    in a N-1 dimensional tensor m. 
 ///    m = max(x, dim = d) 
 /// 
 /// 2. Compute the sum of z along dimension d. This results in 
 ///    a N-1 dimensional tensor l. 
 ///    l = sum(exp(x - m), dim = d) 
 /// 
 /// 3. Divide z and l. This gives the N-dimensional softmax. 
 ///    softmax = exp(x - m) / l 

so it does the exp twice. We still have a small stack allocation (or just keep it in register) for the result of step 1, but that's a reduced result and should be fine.

bjacob commented 1 month ago

Luckily I don't see a reason to believe this is the case on CPU yet.

On the contrary, it's very clear on CPU that the exp computation is essentially all the cost of softmax, and doing it twice makes softmax 2x slower.

so it does the exp twice.

Ooh, OK, that was outside of the realm of possibilities that I considered, coming from a CPU perspective where exp's arithmetic cost is of concern.

IIUC - That the basic difference between GPU and CPU, that GPU can afford to compute exp twice and CPU can't, that explains how GPU avoids a stack buffer while CPU ends up requiring one?

To summarize:

qedawkins commented 1 month ago

Yes, on GPU we're shared memory limited so not rematerializing is really not an option. It's only because stack allocation size is "in theory" limited only by system memory that CPU can even generate code that doesn't rematerialize, but IMO rematerialization should be a requirement for CPU codegen also. The bigger question is whether we want to form one dispatch. On GPU wiring all that extra memory and the extra kernel just isn't worth it. In benchmarks the fused kernel is almost the same cost as each unfused one, so it's the inverse then where fusion offers almost a 2x benefit.

bjacob commented 1 month ago

Right. Besides the alternative you lay out here between two possibilities,

  1. Rematerialization, as currently done on GPU, which is unpalatable on CPU due to the 2x exp cost.
  2. Stack buffer, as currently done on CPU, which is unpalatable outside of an early prototype because stack is in fact more restricted than just "it's memory", both in size and in potential liability in case of buffer overflow.

The purpose of the present issue was to propose a third way to have the exps stored in a buffer but have that buffer avoid the stack. I proposed splitting the softmax into 2 dispatches to achieve that, but that's only a means towards that end.

benvanik commented 1 month ago

For my edification, why is exp so slow? Are we using libm and scalarizing the code? Global barriers and the extra memory traffic/cache misses feels like it'd need quite a large number of cycles to outweigh?

I count about ~20-30 add/mul/shift instructions in https://github.com/vectorclass/version2/blob/master/vectormath_exp.h#L141 - if we're perfectly hitting the cache and not saturating memory bandwidth otherwise and perfectly scheduling everything maybe it's fine to hit memory and schedule the dispatch twice on random cores?

bjacob commented 1 month ago

Yup, ~ 30 was the guesstimate I was about to say before you edited :-) So we agree about how much (or how little, depending on how you view it) work that is. What i'm saying is that is already enough to dominate softmax, because there isn't much else in softmax, and (I guess) contrary to GPU, just retraversing a buffer (that's going to be resident in L2) is not a big enough deal on CPU to outweigh that.

bjacob commented 1 month ago

Will all that said, though, since softmax rarely dominates e2e profiles, if it comes down to just the above https://github.com/iree-org/iree/issues/17469#issuecomment-2125388885 alternative, i think i'd still prefer rematerialization (and pay the nearly 2x hit on softmax, hoping that softmax isn't critical) over stack buffers.

benvanik commented 1 month ago

that's going to be resident in L2

is that true? I thought L2 cache was core local - when we distribute to 32 cores we're almost never going to get hits between two workgroups in different dispatches

bjacob commented 1 month ago

1D softmax is too small (and sequential) to be usefully distributed to multiple threads. N-D softmax has those N-1 parallel dimensions that works well for distribution, and then each thread works on its own separate data?

benvanik commented 1 month ago

(sorry, trying to understand - thanks for explaining :)

In the case of a softmax being too small to distribute to multiple threads I don't think we'd care about exp taking 30 ALU ops twice - it's in the noise compared to other things in the program and decomposing doesn't feel worth it. We should also verify those do actually run on a single core - I suspect we're probably doing something silly like distributing it to dozens of workgroups :)

In the case of it being large and distributed that's the issue I raise: if you have two dispatches A and B each distributed to 200 workgroups and with a global barrier between (all 200 of A must complete before any of the 200 of B can begin), and you have 32 cores available to run those 200 workgroups, the chance of workgroup 0 from A running on the same core as workgroup 0 from B is statistically very low. This is the core reason why for GPUs there's a preference to rematerialize and a many-core CPU looks a lot like a GPU. So I'm curious if we know for certain that rematerializing 20-30 ALU ops is always going to be a significant loss over the two dispatches and a global allocation. Would be worth benchmarking/checking counters/etc before decomposing, IMO.

bjacob commented 1 month ago

Back-of-envelope calculation: if the loop body loads 512bits = 64 bytes and performs 30 AVX-512 instructions on it, issuing in average 1 such instruction per cycle in the loop body (kind of best case scenario on Zen4) then the loop body takes 30 cycles to execute, to process 64 bytes, so it processes ~ 2 bytes per cycle. So at 4 GHz, each thread might process 8 GB/s. A DRAM channel might achieve ~ 50 GB/s, so you need at least 6 threads (running at full spead, not sharing in SMT) per memory channel to max out DRAM bandwidth, so yes you can max out DRAM bandwidth even with multiple channels, but you need some truly massive softmax size.

And the above estimate was based on the worst case scenario of reading data from DRAM every time. If data is resident in L3, according to https://chipsandcheese.com/2022/11/08/amds-zen-4-part-2-memory-subsystem-and-conclusion/ , we get 27 bytes per cycle, so about 2x the above estimate for DRAM, so it would take about 13 threads running at full speed (ie NOT sharing a core with another SMT thread) to max out one L3 cache.

Note: I'm neglecting load-instruction latency because the memory access patterns are simple enough for this to be perfectly prefetched into L1 (even if data has to cycle through L1 as it doesn't all fit there simultaneously).

benvanik commented 1 month ago

Not sure what the latest numbers are, just some ancient ones from 2011 (!), and V_EXP_F32 on GCN was 16 cycles. I suspect it's better now. So, in an alternate universe where x86 grew a vector exp instruction this would all be moot? I'm kind of surprised it hasn't yet?

bjacob commented 1 month ago

I'm no hardware expert, but looking at exp's implementation as a sequence of instructions, it seems inherently costly, so if a circuit is able to do it all under a lower latency than would an equivalent elementary op sequence, that hardware probably paid a substantial price for that (I guess the pro word here would be "area" :-) ). It would have to be justified by exp being the bottleneck in applications... which it isn't really for us. We do have exp's in softmaxes and in other guises, we do wish they'd be faster, but in the end they only account for a few % of overall profiles.

My above argument wasn't necessarily to argue against rematerialization on CPU, just against hoping for that to be painless performance-wise. Like said above, https://github.com/iree-org/iree/issues/17469#issuecomment-2125409997 , I still see rematerialization as the lesser evil vs the stack. This is just to say that it won't be negligible. It will have a clear effect on each worker thread's performance, but good point that in the grand scheme of things, for those larger softmaxes that potentially matter, scalability to many threads (and free of stack issues) might be nice enough to live with suboptimal per-thread performance.

bjacob commented 1 month ago

Note: x87 used to have single-instruction FSIN, FCOS and FSINCOS. But, I checked, somehow it didn't have FEXP. Crazy! I guess that drawing perfect ellipses in early 2D graphics was a really big deal at the time.

qedawkins commented 1 month ago

Going back to the original question, I think it's still worth doing the decomposition in GlobalOptimization because the kind of fusion that softmax does should not be unique to softmax (I think it happens by default with --iree-flow-enable-aggressive-fusion today).

bjacob commented 1 month ago

So I'm curious if we know for certain that rematerializing 20-30 ALU ops is always going to be a significant loss over the two dispatches and a global allocation. Would be worth benchmarking/checking counters/etc before decomposing, IMO.

I was indeed not accounting for dispatch overhead. Just reasoning in the abstract, on a self-contained C program. Good point that for practical purposes, dispatch overhead may sway this towards rematerialization.

bjacob commented 1 month ago

@benvanik, I mulled a way to summarize some of the above discussion as a table. My high-level point here is that the 2 in the first row is much smaller than even the next-slowest value in the second row for DRAM access.

Operation Speed on Zen4 (bytes per cycle) Notes
exp on vector of f32 using 30 AVX-512 instructions 2 assuming one issued per cycle, total ~ 32 cycles to produce 16xf32.
read from DRAM 10 Approximative, depends on CPU clock speed vs RAM speed and channels.
read from L3 cache 27 link
read from L2 cache 32 link (my inference from the graph)
read from L1 cache 58 link (my inference from the graph)
theoretically, issuing one 512-bit load each cycle 64
benvanik commented 1 month ago

That's a really nice summary and a great way to reason about these kind of things! I feel like that'd probably be a good template for other times we end up having to make heuristic-based rematerialization tradeoffs.

This issue is just us thinking about it for now, but I think it'd be neat to get the data points on that table of rematerializing vs not and seeing where we end up ratio-wise. If we measure rematerializing as taking t and precomputing as taking t/2 instead of t/5 or t/13 it would reveal something about either our cache hit rates or the actual overhead of splitting dispatches on a particular problem size and a particular number of cores on a particular chip layout. Varying any of those may help confirm the intuition about this in general and give us confidence in the heuristics ("any problem under this size should do X" or "any problem over this size should do Y") and help us eventually land on the speed-vs-size tradeoff slider.

Neat stuff :)