Closed Max191 closed 2 months ago
After looking into the failing test I don't think we are ready for this flip yet. We need a better way of handling shared memory.
The test failure is cause by 2 tensor.empty()
ops getting tiled to the same size and the CSEd into a single empty. However, the empty ops are sort of a hacky way to represent the shared memory buffers when running GPUPromoteMatmulOperandsPass
. We need a better representation of the shared memory buffers at tensor level so we don't CSE them into a single buffer. The parallelism check in bufferization has been saving us by recreating a new buffer, but it is not a good way to handle this issue.
A couple of ideas would be to create alloc_tensor ops or implement some multi-buffering with both shared memory allocs being split across a single tensor.
CC @MaheshRavishankar @qedawkins @antiagainst
After looking into the failing test I don't think we are ready for this flip yet. We need a better way of handling shared memory.
The test failure is cause by 2
tensor.empty()
ops getting tiled to the same size and the CSEd into a single empty. However, the empty ops are sort of a hacky way to represent the shared memory buffers when runningGPUPromoteMatmulOperandsPass
. We need a better representation of the shared memory buffers at tensor level so we don't CSE them into a single buffer. The parallelism check in bufferization has been saving us by recreating a new buffer, but it is not a good way to handle this issue.A couple of ideas would be to create alloc_tensor ops or implement some multi-buffering with both shared memory allocs being split across a single tensor.
CC @MaheshRavishankar @qedawkins @antiagainst
whereever the tensor.empty
is created we could create a bufferization.alloc_tensor
(or whatever) op. IIUC this op does not CSE. This was explicitly the reason it was split out from tensor.empty
https://github.com/iree-org/iree/pull/17940 fixes the test failure. Rebasing this PR on top of it for now.
@ commit e38fae5421272fe594b7f0e1c09fa09543b50836 (no previous benchmark results to compare)
Benchmark Name | Average Latency (ms) | Median Latency (ms) | Latency Standard Deviation (ms) |
---|---|---|---|
BertLargeTF(stablehlo) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,dt-uk] local\_task(embedded\_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 221.751 | 222.448 | 1.726 |
BertLargeTF(stablehlo) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][experimental-flags,no-dt] local\_task(embedded\_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 788.727 | 780.481 | 36.115 |
DeepLabV3\_fp32(tflite) [x86\_64-cascadelake-linux\_gnu-llvm\_cpu][default-flags,dt-uk] local\_task(embedded\_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] | 8.491 | 8.492 | 0.018 |
[Top 3 out of 92 results showed]
No improved or regressed compilation metrics 🏖️
For more information:
When there is a buffer used inside of an
scf.forall
op that is defined outside of thescf.forall
, bufferization will unconditionally bufferize out of place by default in order to avoid race conditions. However, handling parallel accesses to a buffer should generally be the responsibility of the source program, and if there is a race condition, then it should be handled outside of bufferization. This PR disables the parallel region check in IREE to simplify the bufferization analysis and enable more buffer reuse.It is possible that this PR could cause race conditions if data races are not handled properly, and we are relying too much on bufferization to be conservative. Turning this option off could be a good early step in diagnosing data races on GPU.