iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.59k stars 580 forks source link

[CPU] Disable fusion of matmuls with their producers/consumers in codegen #15595

Open dcaballe opened 10 months ago

dcaballe commented 10 months ago

We are not doing great code-generating matmuls or mmt4d ops. The fact that we are fusing matmuls with their consumers is adding more complexity on top of that so I would like to know what you think about disabling this fusion for now so that we can first focus on generating good code for matmuls in isolation. We could keep that fusion under a flag and keep it disable by default if the regression is pretty significant. WDYT? @MaheshRavishankar @hanhanW

benvanik commented 10 months ago

Since we know where we need to be is fusing with the matmuls how confident are we that if we disable this and make matmuls on their own good the approach taken will ever work when fusing? Would be very unfortunate to disable the fusion (via a massive global nasty flag we make users pass, because we don't want to be making this decision any other way), a bunch of effort was put into making unfused matmuls fast, and then the whole thing had to be redone for fusing.

benvanik commented 10 months ago

Leading q being: if matmul codegen can't handle fusion and we don't think we can make it easily do so (enough so we're trying to disable fusion at the dispatch level) should we just reframe matmul codegen as generating ukernels instead? Run the ukernel conversion and then instead of linking in bitcode go turn the ukernel op into a func.call to a func.func we codegen. That ensures no fusion for the matmul portion but leaves the tiling/distribution locality benefits of dispatch fusion.

qedawkins commented 10 months ago

I would -1 this because those (consumer) fusions can be pretty important on GPU, and fusion decisions are shared. Also I don't think we currently have a way to make target backend dependent decisions in Flow (and ideally we should try to avoid it as much as possible; there will always be different requirements for different targets though). IIRC matmuls aren't fused with any producers by default today outside of special cases, and that largely makes sense to me, as that rematerializes compute and needs more careful thought.

There can always be cases where a particular architecture really does (or doesn't) want a particular fusion and in those cases we should offer the target such control. In this case though, my understanding is that the fusions probably are lucrative, we just aren't doing well on them yet.

dcaballe commented 10 months ago

The main motivation here is that we have to walk before we run. Trying to optimize a matmul with sometimes a pretty large set of fused consumers which will arbitrarily change the memory footprint of the loop nest is far more complicated than optimizing a matmul in isolation. The goal here is to give a step back so that we can get out of the rabbit hole we are at in codegen and build all the required technology incrementally, making sure we are able to optimize matmuls in isolation and matmuls with any producers and consumers (no interest in over-optimizing for matmul in isolation, really). This is not a pure codegen problem only as matmul fusion will also impact DT and UKs in general. However, matmul fusion is not enabled for DT and UK right now, only for pure codegen.

Definitely not my goal to disrupt the GPU work! Hopefully we can find a way to help with CPU codegen without impacting the GPU backend. It seems weird, though, that we have matmul fusion enabled for codegen but not for DT and UK and that we can't have matmul fusion enabled for GPU but not for CPU. I would expect different backends to have different levels of maturity and support so certain level of optionality seems necessary here.

(Not the goal of this discussion but, regarding the lucrative aspect of matmul fusion, fusing matmuls beyond the bias is a controversial topic in general, at least on some CPUs. I've heard strong opinions against and in favor of it. Challenging matmul fusion, though, is not the goal here at all)

qedawkins commented 10 months ago

Definitely not my goal to disrupt the GPU work! Hopefully we can find a way to help with CPU codegen without impacting the GPU backend. It seems weird, though, that we have matmul fusion enabled for codegen but not for DT and UK and that we can't have matmul fusion enabled for GPU but not for CPU. I would expect different backends to have different levels of maturity and support so certain level of optionality seems necessary here.

(Not the goal of this discussion but, regarding the lucrative aspect of matmul fusion, fusing matmuls beyond the bias is a controversial topic in general, at least on some CPUs. I've heard strong opinions against and in favor of it. Challenging matmul fusion, though, is not the goal here at all)

It sounds like we're mostly on the same page then, but this is a bit of a can of worms. Target dependent fusion decisions means when compiling for multiple targets, they have to agree, or we end up in a situation where we basically have to fork the whole model for every target. Would like to know what others' opinions are, but what if we allowed target backends to make local fusion decisions when compiling solely for that target, and defer to defaults otherwise. Probably needs some hooks in here: https://github.com/openxla/iree/blob/main/compiler/src/iree/compiler/Dialect/HAL/Target/TargetBackend.h

Flow already takes a dependency on HAL, so maybe this is fine? This probably needs a larger discussion though on how fusion should be modeled across backends. For GPU often we want to fuse fairly aggressively, as models (especially LLMs) are often memory bound and based on what you're saying, the same won't always be true for CPU.

benvanik commented 10 months ago

Flow should not take a dependency on HAL, it's really not fine at all :) I'm -1000 on any global arch-specific changes. I'm -100 on target-specific changes. I'm neutral on developer flags for experimentation with the knowledge that they will never be flipped or deployed. We're a holistic code generating compiler with multi-targeting - that's our purpose - anything that violates holistic code generation or multi-targeting gets my strongest push back :)

qedawkins commented 10 months ago

Flow should not take a dependency on HAL, it's really not fine at all :) I'm -1000 on any global arch-specific changes. I'm -100 on target-specific changes. I'm neutral on developer flags for experimentation with the knowledge that they will never be flipped or deployed. We're a holistic code generating compiler with multi-targeting - that's our purpose - anything that violates holistic code generation or multi-targeting gets my strongest push back :)

hmmm what is the current dependency coming from then.

benvanik commented 10 months ago

Tech debt that we don't want to add more to - by, say, making global arch-specific decisions :)

qedawkins commented 10 months ago

Oh it's this: https://github.com/openxla/iree/blob/main/compiler/src/iree/compiler/Dialect/Flow/Transforms/ExportBenchmarkFuncs.cpp

qedawkins commented 10 months ago

Tech debt that we don't want to add more to - by, say, making global arch-specific decisions :)

Maybe this is better for a direct discussion, but is what you're saying here that we should never enable users to make arch specific fusion decisions, or that we should never enable such decisions by default?

benvanik commented 10 months ago

Shouldn't make them by default (because we can't) - and it's dispatch-level fusion (what goes in dispatch A vs dispatch B) that is the thing at issue - once inside a dispatch things can vary as much as they want.

qedawkins commented 10 months ago

Shouldn't make them by default (because we can't) - and it's dispatch-level fusion (what goes in dispatch A vs dispatch B) that is the thing at issue - once inside a dispatch things can vary as much as they want.

Ok makes sense, but then we should allow custom control over fusion then? I think that was the idea with adding a transform dialect spec for fusion, as well as a custom fusion pipeline plugin.

benvanik commented 10 months ago

🤷 such knobs are neat but not a solution: if matmul codegen can't codegen with a few arithmetic ops on writeback then it's not codegen as we require for a holistic code generating compiler.

benvanik commented 10 months ago

The consumers we fuse should be things like bias and other quantization-like stuff - we don't fuse convs into matmuls and such. What specifically are you seeing that you don't think should be fused? There's definitely cost model improvements to be had that don't resort to all/nothing and that are likely to apply to many/most targets.

It shouldn't be controversial (or at least, definitely isn't here) that fusing in some basic arithmetic on values in registers or in L1 cache before writing back small tiles of data to main memory is a universally good idea. Has been for nearly the entirety of computing. It shouldn't require any additional storage to do so or if it does it should be scoped to a single tile of work (KiB) - if it does require much more that's the issue we should fix and what I'm concerned we won't if we just stop feeding in the ops and pretend they don't exist. Global writebacks into main memory, global barriers on all tensor contents, global readbacks of that written memory, and basic arithmetic of which then likely gets written back all over again (because it's not fused into subsequent matmuls) should never be more efficient on any architecture of any type (talking joules or time). Matmuls should practically never be write bound on processors of the class we care about - are you aware of architectures where we both care about the performance (the matmuls are big/high percentage of total model cost of interesting models) and are write bound such that an additional handful of in-register or in-cache ALU ops slows it down/consumes more power? I know of very tiny systems that have zero pipelining but those are the ones where ukernels are way better solutions than code generation anyway as they are so extremely sensitive to register allocation and such. If a hand-authored ukernel written for such archs is the best solution we should just do that for those archs.

As suggested above (and which I think is an interesting avenue regardless) if codegen can't handle anything but bias adds now or in the future due to the design of the system then we would need to rescope what this version of matmul codegen is doing. If codegen has the same limits as ukernels (note that ukernel != mmt4d) - which can't fuse beyond what they were coded to fuse - then matmul codegen should scope itself to generating ukernels. Global dispatch fusion decisions are not something we want to make based on limitations of certain target codegen approaches. Codegenning a ukernel is a strict subset of codegenning a full dispatch and should be the easy case to walk forward you want without disrupting the whole system. Basically, if matmul codegen on a particular arch cannot work on the dispatch level then we need to scope it below that level, not limit all dispatches on all archs to what current matmul codegen can handle.

I'm mainly saying this as a way to avoid sidelining matmul codegen even further: if it becomes easier to build, maintain, and deploy hand-authored ukernels than to use matmul codegen as it can't provide the benefits it's intended for (not requiring ukernel-style fixed function definitions and arch-specific global program changes and other sorts of hand-holding) then focus will shift toward more ukernels and less codegen. At the point we have two very high level paths for executing fixed function matmuls that require forks of the entire model/compilation flow we're going to have to pick only one to support and that will be the one we can guarantee performance with. Let's try to avoid having to make those decisions - the thing that will force such a decision is trying to make dispatch-level fusion decisions :)

qedawkins commented 10 months ago

Gah, that was way more than I could type on my phone. I think your response there is directed at the start of the thread (and maybe I'm just adding noise at this point), but my understanding for the motivation behind such knobs initially was to enable extra fusion, not disable the ones we have on right now. This is kind of what I'm trying to glean with my line of questioning, and that is whether there is any room in our compiler for extra fusion decisions like that. Flash attention, which while very much so tied to a framework-level view of ML models, still showed that there is room for fusions on some architectures that might fundamentally not be worth it on others. FA (and anything FA-like) might end up being a "universally lucrative thing," but it does cost parallelism. Such decisions always have tradeoffs and I'm wondering how we balance that without being making arch-specific decisions.

Maybe the answer here is just that we never "enable it by default" and rely on downstream flows to make those decisions with appropriate plugin points, and that makes sense to me. (Not trying to be argumentative, just trying to poke the bounds of what can and can't be done with IREE's design). For example, would it ever be possible to recover a "dispatch granularity" within a dispatch? I don't mean from HAL's perspective, but instead, say, today's Vulkan/SPIR-V codegen decided it wants to execute a small DAG of shaders for a particular dispatch instead of one.

MaheshRavishankar commented 10 months ago

I might have missed details but I overall +100 what Ben is saying. I think we are looking at two ends of the discussion here, 1) no fusion and 2) aggressive fusion. I have been trying to thread the middle ground here. The default fusion done by IREE is primarily driven by an analysis of access patterns in producer and consumer (and the assumption that we use tile and fuse to generate code). So fusions like matmul/conv + elementwise fusion fall into that category. Fusing producers into consumer conv+matmul introduces redundant computations, so it wasn't done till now, but I think we are at a place where we need to fuse dequantization with it's consumers, so that is being worked on. In my view any first class backend should support these fusions.

Now the flip side. Also fusion definition as "put these in a dispatch" is very simplistic way of seeing fusion and that is really easy to do. The real question is can you generate code for everything in a dispatch in a way that is not a one-off (really if your Codegen pipeline is a one-off codegeneration flow you might as well call a library and call it a day, the point of Codegen is you don't need to keep doing one offs).

For experimentation purposes, you can easily write a pass that moves things you want to "fuse" into a dispatch, but unless you can use some strategy to generate code for what's in a dispatch, then it won't scale.

Coming back to problem at hand. @dcaballe I am guessing you are having issues since you are trying multi-level tiling, tiling the reduction dimension which stops fusion at the highest level of cache and running into issues? If you want to experiment, I am happy to give you a developer flag that disables that fusion but that flag would probably never turn on by default.

dcaballe commented 10 months ago

Thanks for all the feedback. Not sure how the discussion went off rails again, though :_(. The motivation is very simple: we want to implement feature A and we should start from the simplest use case possible and, once that is working ok, incrementally build upon it to support more complex cases. This is not a codegen vs ukernel discussion, this is not a codegen limitation, no plans on optimizing matmuls in isolation and stop there… it’s just a simple software engineering practice… No side interpretations of any kind, really. It’s just how I would like to approach matmul/mmt4d codegen. Hopefully I can get some trust here.

Regarding matmul fusion, I’ve seen cases where we fused A LOT, which brought a bunch of new input tensors into the matmul loop nest. That will obviously impact the strategy to optimize the matmul itself. I just want to avoid that complexity on the first iterations of the work. Introducing a flag is far from ideal but I can resign myself to that, if there is no other option.

Regarding fusion being a hardware independent optimization… I could add some experience to that from affine fusion and ICC. However, what is more concerning to me is that a few seasoned engineers have raised similar concerns on the matter and we haven’t done much to mitigate this limitation. It would be great if we could think about what we could do here in general in a way that works for the IREE design that you have in mind. If that means using encodings or any kind of multi-stage fusion approach (one generic (maximal fusion?) vs one hw-dependent (refinement of the previous one?)), so be it but I think we need to do something about this. We don't necessarily have to discuss this here, though :). Mai-Tai could be a better venue?

stellaraccident commented 10 months ago

I "uggh'd" a bit when I doom scrolled through this thread this morning, as it is a very familiar fault line. Can we please take the easy incremental out while acknowledging the principles:

  1. The goal is to have code generation that is not functionally or grossly performance fragile to this kind of decision at the high level. If a developer flag helps the humans making that true do their work, then let's do that. (note: if I read past all of the other commentary, that sounds like what is being asked for. Is it?)
  2. High-end performance sometimes requires specific fusion heuristics to fire differently. In my experience, this isn't so much hardware specific as the fuzzy area of workload+user+compiler cooperation. We should have mechanisms to reach these highest tiers of performance. The compiler should generate correct and grossly performant code for any of the combinations.
  3. This is not the first time I've heard of a desire for some form of multi-level fusion, etc (ironically from all parties on this thread but possibly using incompatible vocabulary choices). This is a design discussion. Let's have that in a forum optimized for discussing such things if it is important to solve.

(edit: to the extent it helps, please ping me to make sure I join the mai-tai sync where we discuss this. If I can, I'm happy to help translate perspectives)

MaheshRavishankar commented 10 months ago

We can discuss more in mai-tai meeting, but really the fusion we do today arent in the "aggressive fusion" category. They are pretty basic. So for anything done today, I dont think we have to limit it on any hardware that is one of the supported backends in IREE (CPU or GPU). (If anything we field more requests for exotic fusions beyond what is done in IREE).

stellaraccident commented 10 months ago

Yeah, I agree with that. We have to be on a strong trajectory to more advanced fusions with general support.

I just want to make sure we aren't aliasing the end state with practical tools people need to get there.