Open jon-chuang opened 1 year ago
Let me try to create a simple regression testcase where this (load + transpose + convert_layout<#mma> + dot
) fails to generate insert_slice_async
.
Caveat: I am using an old version from 14 July (1.5 months old). I actually should test this on main to see if it can be reproduced.
Have some evidence that main no longer has this perf issue (see: https://github.com/google/jax/pull/17328#issuecomment-1705010065). However, let me add a regression anyway to confirm it.
Currently, there is no async load op (
insert_slice_async
) generated for this scenario, as documented in https://github.com/openai/triton/blob/13189bfe60e135d3e7e624f8a6d5e953951c1b5e/lib/Dialect/TritonGPU/Transforms/Pipeline.cpp#L448Maybe related: Reported issues with transpose (both perf and correctness) - https://github.com/openai/triton/issues/1806, https://github.com/openai/triton/issues/2150, https://github.com/openai/triton/issues/1714