Closed jwfromm closed 1 month ago
Name | Link |
---|---|
Latest commit | 206da71dee306c3a1d2be2a8d8c73002e2d361e4 |
Latest deploy log | https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/66aacf93990dff000810201a |
Deploy Preview | https://deploy-preview-2919--pytorch-fbgemm-docs.netlify.app |
Preview on mobile | Toggle QR Code...Use your smartphone camera to open QR code link. |
To edit notification comments on pull requests, go to your Netlify site configuration.
This pull request was exported from Phabricator. Differential Revision: D60535956
This pull request was exported from Phabricator. Differential Revision: D60535956
This pull request was exported from Phabricator. Differential Revision: D60535956
This pull request has been merged in pytorch/FBGEMM@f7fc750b1b3c856e65447641d467b70b983308d6.
Summary: This diff fixes an issue where our triton fp8 quantize functions didnt properly handle non-contiguous inputs. Specifically, they write to the output tensor using the same strides as the input, when the output is always allocated as contiguous. This resulted in the output being unintentionally transposed in some cases.
The result of this issue was that non-contiguous inputs would run fine but produce silently transposed outputs. It was noted in github here: https://github.com/pytorch/FBGEMM/issues/2713
Adding explicit output strides to the kernel resolves the issue.
I also found a small issue with D59248142 where scaling wouldnt be applied when the number of elements was smaller than the blocksize. This caused fp8_gemm_test to fail. I resolved it by extending the check for when to scale.
Reviewed By: jianyuh
Differential Revision: D60535956