iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.72k stars 598 forks source link

Fix performance of mma sync #9689

Open allieculp opened 2 years ago

allieculp commented 2 years ago

Request description

From Nod.ai meeting 6/30, filing new issue

What component(s) does this issue relate to?

No response

Additional context

No response

allieculp commented 2 years ago

@ThomasRaoux @manishucsd From today's meeting, looking for an update here.

allieculp commented 2 years ago

From today's meeting: @manishucsd WIP, still trying a few different options

allieculp commented 2 years ago

Hey @manishucsd can you update the issue here?

erob710 commented 2 years ago

Improvements but not meeting target yet. WIP.

manishucsd commented 2 years ago

Summary

We are now cleaning and refactoring code the above changes in the big dummy PR into the following smaller pull requests:

(1) Support GEMM Pipelining without Epilogue Peeling:

(2) Breaking warp shapes to native math shapes:

(3) compiler/src/iree/compiler/Codegen/Common/GPUPipelining.cpp:

manishucsd commented 2 years ago

(1) Support GEMM Pipelining without Epilogue Peeling is done and merged.

PR #10388 on supporting GEMM pipelining without epilogue peeling (Unpeeled Epilogue).

PR #10451 on enabling and analyzing unpeeled epilogue for WMMA-based GEMMs.

(2) Breaking warp shapes to native math shapes is in progress...

manishucsd commented 2 years ago

Progress on bullet (2): Handles native sizes for nnvgpu.mma.sync and nvgpu.ldmatrix are ready to start merging into llvm/llvm-project and iree-org/iree.

(i) llvm/llvm-project, and

(ii) iree-org/iree repo.

The above changes handle:

I am going to be spending some time on mma.sync GEMM e2e testing for F16 <= F16 * F16 + F16.

allieculp commented 1 year ago

@manishucsd Is this still active?

manishucsd commented 1 year ago

We have pushed the changes to improve Ampere Tensor core mma.sync performance for F16 and F32. We are now tracking performance issues and further improvements in smaller PRs. I think we can close this PR as the major part of this epic-like / blanket issue work has been pushed in:

cc: @julianwa @mattwalsh