Additional Tuning for Cutlass FP8 Rowwise Kernel

jwfromm commented 3 weeks ago

Summary: This diff implements additional tuning for the cutlass rowise kernel on top of the recent output layout change. Our configurations are now much more conformant with recommendations made by the cutlass tuner. To maintain performance across all shapes, I had to add one more kernel mode which sets Cooperative kernels for medium shapes, and PingPong kernels for large shapes.

Benchmarking results can be found here in the results tab. The names of the different tuning configurations I tried are kind of vague, but the final column is the one that is represented by these changes.

Differential Revision: D58848687

netlify[bot] commented 3 weeks ago

Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
Latest commit	833557ff34c5ae7900d3c8ca8c1163c57cc2b92d
Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/6674c4b64fdaf50008e4f808
Deploy Preview	https://deploy-preview-2762--pytorch-fbgemm-docs.netlify.app
Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

facebook-github-bot commented 3 weeks ago

This pull request was exported from Phabricator. Differential Revision: D58848687

pytorch / FBGEMM