Open davidberard98 opened 2 days ago
@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
Stack from ghstack (oldest at bottom):
2349
Overall context: Before looking further into the bf16xint4 matmul, I'm planning to look into a bf16xint16 matmul first. The idea of this matmul is that it will just be the same as a bf16xbf16 matmul, except the second operand needs to be casted from int16 to bf16 in the triton kernel before executing.
This PR: is NOT fully functional yet. It's just implemented this way to make review easier.
There's 3 kernels that will be benchmarked here:
Differential Revision: D59234085