Closed jwfromm closed 2 weeks ago
Name | Link |
---|---|
Latest commit | 858cb9f889e4febabdb849ae1ecc3d0b13914262 |
Latest deploy log | https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/667c3a9ea6f318000857ae3b |
Deploy Preview | https://deploy-preview-2786--pytorch-fbgemm-docs.netlify.app |
Preview on mobile | Toggle QR Code...Use your smartphone camera to open QR code link. |
To edit notification comments on pull requests, go to your Netlify site configuration.
This pull request was exported from Phabricator. Differential Revision: D58960601
This pull request has been merged in pytorch/FBGEMM@1b9a5bc91916e641e08b6f9e38ce52cf02d3e521.
Summary: It recently was noted that in some cases using unfused cublas FP8 matmul and a followup rowwise scaling kernel was more efficient than trying to use fused kernels. This Diff makes it easier to use this unfused approach and adds a triton kernel specifically for stand-alone rowwise scaling.
We do see that this unfused cublas + scale approach is better for very large shapes than cutlass. Maybe we can do a multi dispatch just like we did for tensorwise scaling.
Results Sheet
{F1717407393}
Reviewed By: jianyuh, jiawenliu64
Differential Revision: D58960601