Open allieculp opened 2 years ago
@ThomasRaoux @manishucsd From today's meeting, looking for an update here.
From today's meeting: @manishucsd WIP, still trying a few different options
Hey @manishucsd can you update the issue here?
Improvements but not meeting target yet. WIP.
Summary
ldsm
) and math (mma.sync
) operations shows performance gains. We are now at 70us for the GEMM we are measuring (3456x1024x2048xf16). We are now cleaning and refactoring code the above changes in the big dummy PR into the following smaller pull requests:
(1) Support GEMM Pipelining without Epilogue Peeling:
(2) Breaking warp shapes to native math shapes:
compiler/src/iree/compiler/Codegen/LLVMGPU/LLVMGPUTensorCoreVectorization.cpp
(3) compiler/src/iree/compiler/Codegen/Common/GPUPipelining.cpp:
(1) Support GEMM Pipelining without Epilogue Peeling is done and merged.
PR #10388 on supporting GEMM pipelining without epilogue peeling (Unpeeled Epilogue).
PR #10451 on enabling and analyzing unpeeled epilogue for WMMA-based GEMMs.
(2) Breaking warp shapes to native math shapes is in progress...
Progress on bullet (2): Handles native sizes for nnvgpu.mma.sync
and nvgpu.ldmatrix
are ready to start merging into llvm/llvm-project and iree-org/iree.
(i) llvm/llvm-project, and
(ii) iree-org/iree repo.
The above changes handle:
I am going to be spending some time on mma.sync GEMM e2e testing for F16 <= F16 * F16 + F16.
@manishucsd Is this still active?
We have pushed the changes to improve Ampere Tensor core mma.sync performance for F16 and F32. We are now tracking performance issues and further improvements in smaller PRs. I think we can close this PR as the major part of this epic-like / blanket issue work has been pushed in:
F16
mma.sync
, code changes merged, and enabled by defaultF32
mma.sync
, code changes merged, but not enabled by default (#13105) cc: @julianwa @mattwalsh
Request description
From Nod.ai meeting 6/30, filing new issue
What component(s) does this issue relate to?
No response
Additional context
No response