Open pzread opened 1 year ago
Not sure about the solution here. I think it's hard to make linalg::getConstantUpperBoundForIndex
in LLVMCPUCheckIRBeforeLLVMConversion
to know the upper bound of affine.apply
.
If it's not easy to solve, maybe we can disable the check in LLVMCPUCheckIRBeforeLLVMConversion
as a temporary solution.
@hanhanW FYI
Reopen since we decided to disable some tests for this issue
This problem seems to get worse when we enable vector masking (https://github.com/iree-org/iree/pull/11286). I'm now getting similar stack overflow errors when building iree-test-deps
:
/usr/local/google/home/diegocaballero/iree2/tests/e2e/models/mobilenetv3_fake_weights.mlir:977:12: error: 'builtin.module' op expected total size of stack allocation is not greater than 32768 bytes, but got 74112 bytes
%731 = mhlo.maximum %730, %262 : tensor<1x1x1x144xf32>
/usr/local/google/home/diegocaballero/iree2/build/debug/tests/e2e/matmul/e2e_matmul_direct_f32_large_split_k_llvm-cpu_local-task.mlir:12:13: error: 'builtin.module' op expected total size of
stack allocation is not greater than 32768 bytes, but got 53504 bytes
From a compiler perspective, I wonder how is that we are not using heap allocation. It wasn't mentioned in our discussions so I guess there are some limitations that prevent that from doing so but I'm pretty surprise that we can get things moving only with stack allocations. Anybody knows?
There seems to be even more stack allocation issues when building the benchmark suite with masking. We need to think about what to do because this is now a blocker. Any other thoughts @pzread, @MaheshRavishankar, @hanhanW?
Do we know the source of those stack allocations? Are they unbounded allocation or fixed size but too large? If they are fixed size but too large, then that's probably what we want to avoid and we need to look into how to eliminate them.
I'm quite curious about why vector masking will generate more stack allocations.
I guess there are some limitations that prevent that from doing so but I'm pretty surprise that we can get things moving only with stack allocations.
I think IREE dose not want heap allocation in codegen because not every device target has a big heap. @benvanik knows more about it. If we really need a big heap buffer, we can maybe propagate it to HAL, and pass the buffer as one of inputs/outputs. @benvanik mentioned that it's doable and he might already build some pieces that we can reuse.
One question is -- why do we need a big buffer here? looking into the input IR and assuming that there is producer (which is a matmul) in the same dispatch.
matmul ... -> tensor<1x1x144xf32>
%731 = mhlo.maximum %730, %262 : tensor<1x1x1x144xf32>
Do we really need an extra buffer for codgen the dispatch? If so, why is it larger than 32 KB?
(I thought integrating vector masking would be an NFC step, and then we will extend the usage for getting better performance. but it seems not..?)
Could you post the IR to diagnose the problem better, but broadly what Hanhan says. We only allow limited stack allocations and no additional heap allocations within the dispatch. So for the dispatches we create, we should never need stack or heap allocations (or at least outside the bound). Mostly if you vectorize you shouldnt need any heap allocation, so I am not sure why masking is requiring more stack/heap allocation....
+1 hanhan/mahesh! There are ways to get transient buffers but I've yet to see a case where it's actually needed in any of the kind of things we codegen (today) so when issues are encountered it's a good indicator that something in the pipeline is failing.
Hopefully useful background because it's been a year or two since this last came up:
The base assertion is that if you're doing a local workgroup worth of work you shouldn't need a global problem-sized block of allocated memory. If, however, you're just on the border of what's reasonable in the local scope (maybe this issue?) that's something that will need to factor in to distribution across all backends as they all have similar limits (shared memory sizes on GPUs, etc). An example of one way we could handle this is to setup the codegen translation pipeline to try to codegen with the given distribution, fail when too much local memory used, and retry with finer distributions until it succeeds. We will need this kind of behavior in some form anyway as when we are targeting a broad spectrum of Vulkan/Metal/etc devices we're likely to need to produce those variants: https://vulkan.gpuinfo.org/displaydevicelimit.php?name=maxComputeSharedMemorySize&platform=all (today we just hardcode the attributes but that's not something we can ship). That only works if we can get to a reasonable point where scaling the distribution scales the local memory use, though, and the best case would be if it's parametric. Parametric representations of this stuff ("given this workload and distribution how much memory do you need?" -> "given this workload and available memory how much distribution do you want?") would let us tune (certain) aspects of distribution at runtime based on the available hardware.
But the best thing to keep in mind both in the short- and long-term is to focus on minimizing workgroup local memory sizes as much as practically/algorithmically possible. It would be an interesting metric to track over time (alongside dispatch binary size and such) as it's usually only possible to see the derivative aspects in vendor tools (decreased utilization because higher shared resource contention/etc). Stare at the tools long enough and you get a healthy fear of shared memory when it's not absolutely required: That is to say if someone introduces a change that adds even just a few KB of additional shared memory consumption (a few wide vectors) then occupancy will be halved so IMO any problems we have around additional large shared memory allocations being produced by codegen should be P1 performance issues. This is really critical on GPU but also important on CPU: our "shared memory" is CPU data caches and if we spill out of that we risk introducing cache contention and thus lowering occupancy.
Thank you all for the detailed feedback and thank you, Che Yu, for giving me some hits for the investigation! I fixed the problem. We generated a pad + matmul and the pad operation wasn't vectorized with my code whereas the matmul was. That created a temporary buffer between the two ops larger than the current 32KB upper limit.
I agree with the approach and principles described above. However, from a production perspective, I wonder if we are not prioritizing performance vs support/coverage. We may want the compiler to generate "slower" code that can run instead of bailing out compilation. Is the use of shared memory always mandatory? Are we able to fall back to global memory for cases where we exceed a specific shared memory threshold? Have we considered heap allocation or the use of smart allocators for some targets? (Sorry, lot of questions :)) I think these points may become more relevant as we invest more in datacenter support.
I agree with the approach and principles described above. However, from a production perspective, I wonder if we are not prioritizing performance vs support/coverage. We may want the compiler to generate "slower" code that can run instead of bailing out compilation. Is the use of shared memory always mandatory? Are we able to fall back to global memory for cases where we exceed a specific shared memory threshold? Have we considered heap allocation or the use of smart allocators for some targets? (Sorry, lot of questions :)) I think these points may become more relevant as we invest more in datacenter support.
All this should actually be factored into the tile size selection, and use this limit while deciding the tile sizes / configurations to chose.... Ben can probably explain the tradeoffs for the other questions
Coming from the other issue. Here is the IR before bufferization: https://gist.github.com/hanhanW/3dfd7d87a456f28319c6569e2d498cfd/raw
%18 = flow.dispatch.tensor.load %9, offsets = [%arg0], sizes = [%17], strides = [1] : !flow.dispatch.tensor<writeonly:tensor<?xf32>>{%6} -> tensor<?xf32>
%21 = scf.for %arg1 = %c0 to %20 step %c4 iter_args(%arg2 = %18) -> (tensor<?xf32>) {
...
}
%22 = scf.for %arg1 = %20 to %17 step %c4 iter_args(%arg2 = %21) -> (tensor<?xf32>) {
...
}
flow.dispatch.tensor.store %22, %9, offsets = [%arg0], sizes = [%17], strides = [1] : tensor<?xf32> -> !flow.dispatch.tensor<writeonly:tensor<?xf32>>{%6}
@matthias-springer I'm wondering if we can teach bufferization to not allocate a stack buffer for this type of case? It looks like all the operations can reuse the destination buffer?
@hanhanW These two changes should remove the alloc + copy: https://reviews.llvm.org/D140007 https://reviews.llvm.org/D140008
@hanhanW I prepared #11550, which cherry-picks those commits and few others that I need (for a different purpose).
Thanks a lot for the quick fix!!
Awesome! Should we close this issue?
There are still two test cases that were not fixed by #11550:
integrations/tensorflow/test/iree_tf_tests/math/llvmcpu__dynamic_dim_log_softmax.run:# TODO(#11196): Re-enable once the issue is resolved.
integrations/tensorflow/test/iree_tf_tests/math/llvmcpu__dynamic_dim_softmax.run:# TODO(#11196): Re-enable once the issue is resolved.
Hey @matthias-springer, what is the state of the the last two cases that were not fixed by #11550? Is there anything pending here?
We found that
LinalgStrategyPeelPass
optimizes away someaffine.min
used bytensor.empty
, which makes a later check not be able to calculate the upper bound of memory allocation.This is originally found when debugging a failed test
iree_tf_tests/math/llvmcpu__dynamic_dim_softmax.run
with a WIP change (#10770).Here is a crafted example to reproduce the issue:
By compiling the example with
iree-compile --iree-hal-target-backends=llvm-cpu
, the compiler throws an error:What Happened
Before LinalgStrategyPeelPass, the example is tiled into:
After LinalgStrategyPeelPass, the output is:
We can notice that the size of tensor allocation in the second loop becomes a simple
affine.apply
due to the optimization [1]:while originally it was:
This causes the later
LLVMCPUCheckIRBeforeLLVMConversion
fails to derive the upper bound of tensor allocation [2].