pytorch / FBGEMM

FB (Facebook) + GEMM (General Matrix-Matrix Multiplication) - https://code.fb.com/ml-applications/fbgemm/
Other
1.12k stars 451 forks source link

Add Cublas FP8 + Rowwise Scaling Kernel #2786

Closed jwfromm closed 2 weeks ago

jwfromm commented 2 weeks ago

Summary: It recently was noted that in some cases using unfused cublas FP8 matmul and a followup rowwise scaling kernel was more efficient than trying to use fused kernels. This Diff makes it easier to use this unfused approach and adds a triton kernel specifically for stand-alone rowwise scaling.

We do see that this unfused cublas + scale approach is better for very large shapes than cutlass. Maybe we can do a multi dispatch just like we did for tensorwise scaling.

Results Sheet

{F1717407393}

Reviewed By: jianyuh, jiawenliu64

Differential Revision: D58960601

netlify[bot] commented 2 weeks ago

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
Latest commit 858cb9f889e4febabdb849ae1ecc3d0b13914262
Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/667c3a9ea6f318000857ae3b
Deploy Preview https://deploy-preview-2786--pytorch-fbgemm-docs.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

facebook-github-bot commented 2 weeks ago

This pull request was exported from Phabricator. Differential Revision: D58960601

facebook-github-bot commented 2 weeks ago

This pull request has been merged in pytorch/FBGEMM@1b9a5bc91916e641e08b6f9e38ce52cf02d3e521.