Add async load for fused mma `convert_layout` and transpose (`ldmatrix.sync.aligned.m8n8.x4.trans.shared.b16`)

triton-lang / triton

Development repository for the Triton language and compiler

https://triton-lang.org/

MIT License

12.53k stars 1.52k forks source link

Add async load for fused mma `convert_layout` and transpose (`ldmatrix.sync.aligned.m8n8.x4.trans.shared.b16`) #2233

Open jon-chuang opened 1 year ago

jon-chuang commented 1 year ago

Currently, there is no async load op (insert_slice_async) generated for this scenario, as documented in https://github.com/openai/triton/blob/13189bfe60e135d3e7e624f8a6d5e953951c1b5e/lib/Dialect/TritonGPU/Transforms/Pipeline.cpp#L448

Maybe related: Reported issues with transpose (both perf and correctness) - https://github.com/openai/triton/issues/1806, https://github.com/openai/triton/issues/2150, https://github.com/openai/triton/issues/1714

jon-chuang commented 1 year ago

Let me try to create a simple regression testcase where this (load + transpose + convert_layout<#mma> + dot) fails to generate insert_slice_async.

Caveat: I am using an old version from 14 July (1.5 months old). I actually should test this on main to see if it can be reproduced.

jon-chuang commented 1 year ago

Have some evidence that main no longer has this perf issue (see: https://github.com/google/jax/pull/17328#issuecomment-1705010065). However, let me add a regression anyway to confirm it.