Fix triton fp8 handling of non-contiguous inputs

jwfromm commented 1 month ago

Summary: This diff fixes an issue where our triton fp8 quantize functions didnt properly handle non-contiguous inputs. Specifically, they write to the output tensor using the same strides as the input, when the output is always allocated as contiguous. This resulted in the output being unintentionally transposed in some cases.

The result of this issue was that non-contiguous inputs would run fine but produce silently transposed outputs. It was noted in github here: https://github.com/pytorch/FBGEMM/issues/2713

Adding explicit output strides to the kernel resolves the issue.

I also found a small issue with D59248142 where scaling wouldnt be applied when the number of elements was smaller than the blocksize. This caused fp8_gemm_test to fail. I resolved it by extending the check for when to scale.

Reviewed By: jianyuh

Differential Revision: D60535956

netlify[bot] commented 1 month ago

Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
Latest commit	206da71dee306c3a1d2be2a8d8c73002e2d361e4
Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/66aacf93990dff000810201a
Deploy Preview	https://deploy-preview-2919--pytorch-fbgemm-docs.netlify.app
Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

facebook-github-bot commented 1 month ago

This pull request was exported from Phabricator. Differential Revision: D60535956

facebook-github-bot commented 1 month ago

This pull request was exported from Phabricator. Differential Revision: D60535956

facebook-github-bot commented 1 month ago

This pull request was exported from Phabricator. Differential Revision: D60535956

facebook-github-bot commented 1 month ago

This pull request has been merged in pytorch/FBGEMM@f7fc750b1b3c856e65447641d467b70b983308d6.

pytorch / FBGEMM