Fix performance of mma sync

allieculp commented 2 years ago

Request description

From Nod.ai meeting 6/30, filing new issue

What component(s) does this issue relate to?

No response

Additional context

No response

allieculp commented 2 years ago

@ThomasRaoux @manishucsd From today's meeting, looking for an update here.

allieculp commented 2 years ago

From today's meeting: @manishucsd WIP, still trying a few different options

allieculp commented 2 years ago

Hey @manishucsd can you update the issue here?

erob710 commented 2 years ago

Improvements but not meeting target yet. WIP.

manishucsd commented 2 years ago

Summary

Scheduling Shared Memory loads (ldsm) and math (mma.sync) operations shows performance gains. We are now at 70us for the GEMM we are measuring (3456x1024x2048xf16).
We reduced it from 300us to 78us and were stuck at 78us for some time. We needed fine-grained instruction scheduling at the nvgpu level to break 78us to 70us.
Cleaning up the transformation and preparing PRs to push into IREE in order to transition to mma.sync.
All the changes and transformation we have tried are in the following dummy PR here

We are now cleaning and refactoring code the above changes in the big dummy PR into the following smaller pull requests:

(1) Support GEMM Pipelining without Epilogue Peeling:

Ready to Merge see PR #10388

(2) Breaking warp shapes to native math shapes:

compiler/src/iree/compiler/Codegen/LLVMGPU/LLVMGPUTensorCoreVectorization.cpp
Handle all mma shapes and data type in the code here

(3) compiler/src/iree/compiler/Codegen/Common/GPUPipelining.cpp:

Move all the mma.sync-related GPUPipeline work into a different function/structure
We don’t change the schedule for wmma
New pipeline structure
LDSM + MMA scheduling

manishucsd commented 2 years ago

(1) Support GEMM Pipelining without Epilogue Peeling is done and merged.

PR #10388 on supporting GEMM pipelining without epilogue peeling (Unpeeled Epilogue).

Unpeeled epilogue is shorter and tighter, but requires efficient predication support for operations in the mainloop.
For example, an unpeeled epilogue requires handling OOB access on Global memory for the last few GEMM iterations.
We added AsyncCopyOp with Zfill to support this.
A new pass option allows one to switch between peeled vs. unpeeled epilogue.

PR #10451 on enabling and analyzing unpeeled epilogue for WMMA-based GEMMs.

This PR has shown 19% reduction in the dispatch size without any performance regression on MiniML (125KB vs. 154 KB)

(2) Breaking warp shapes to native math shapes is in progress...

manishucsd commented 2 years ago

Progress on bullet (2): Handles native sizes for nnvgpu.mma.sync and nvgpu.ldmatrix are ready to start merging into llvm/llvm-project and iree-org/iree.

(i) llvm/llvm-project, and

iree-org/iree-llvm-fork integrate process is running behind, but I have started a llvm phabricator patch D135749

(ii) iree-org/iree repo.

The above changes handle:

Folding data transpose needed for mma.sync into ldmatrix.
Native mma.sync matmul sizes for F16 datatype.
Largest ldmatrix tile size for matrixA and matrixB.
This requires properly handling vector.extract_strided_slice during VectorToGPU lowering.

I am going to be spending some time on mma.sync GEMM e2e testing for F16 <= F16 * F16 + F16.

allieculp commented 1 year ago

@manishucsd Is this still active?

manishucsd commented 1 year ago

We have pushed the changes to improve Ampere Tensor core mma.sync performance for F16 and F32. We are now tracking performance issues and further improvements in smaller PRs. I think we can close this PR as the major part of this epic-like / blanket issue work has been pushed in:

For F16 mma.sync, code changes merged, and enabled by default
For F32 mma.sync, code changes merged, but not enabled by default (#13105)

cc: @julianwa @mattwalsh

iree-org / iree