Add Cublas FP8 + Rowwise Scaling Kernel

pytorch / FBGEMM

FB (Facebook) + GEMM (General Matrix-Matrix Multiplication) - https://code.fb.com/ml-applications/fbgemm/

Other

1.12k stars 451 forks source link

Add Cublas FP8 + Rowwise Scaling Kernel #2786

Closed jwfromm closed 2 weeks ago

jwfromm commented 2 weeks ago

Summary: It recently was noted that in some cases using unfused cublas FP8 matmul and a followup rowwise scaling kernel was more efficient than trying to use fused kernels. This Diff makes it easier to use this unfused approach and adds a triton kernel specifically for stand-alone rowwise scaling.

We do see that this unfused cublas + scale approach is better for very large shapes than cutlass. Maybe we can do a multi dispatch just like we did for tensorwise scaling.

Results Sheet

{F1717407393}

Reviewed By: jianyuh, jiawenliu64

Differential Revision: D58960601

netlify[bot] commented 2 weeks ago

Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
Latest commit	858cb9f889e4febabdb849ae1ecc3d0b13914262
Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/667c3a9ea6f318000857ae3b
Deploy Preview	https://deploy-preview-2786--pytorch-fbgemm-docs.netlify.app
Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

facebook-github-bot commented 2 weeks ago

This pull request was exported from Phabricator. Differential Revision: D58960601

facebook-github-bot commented 2 weeks ago

This pull request has been merged in pytorch/FBGEMM@1b9a5bc91916e641e08b6f9e38ce52cf02d3e521.

pytorch / FBGEMM

Add Cublas FP8 + Rowwise Scaling Kernel #2786

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Deploy Preview for pytorch-fbgemm-docs ready!