Closed Dewei-Wang-sh closed 2 months ago
I think we need to understand the reason prefetching operand A causes a performance degradation. Collect unitrace data to determine what metric degrades is you prefetch that operand (vs not prefetching it) as a starting point. Are we running out of cache capacity ? If so we could tune how many iterations are prefetched (pipelined).
I think we need to understand the reason prefetching operand A causes a performance degradation. Collect unitrace data to determine what metric degrades is you prefetch that operand (vs not prefetching it) as a starting point. Are we running out of cache capacity ? If so we could tune how many iterations are prefetched (pipelined).
second that
@Dewei-Wang-sh, it would also be helpful to have a decent description of what is supposed to be done in the issue.
@Dewei-Wang-sh, it would also be helpful to have a decent description of what is supposed to be done in the issue.
thanks for the remind, added
Per our discussions, the feature disable prefetch for small shapes
we be abandoned due to it's harmful to average performance for some shapes and just benefit max performance for limited shapes. As for the tl.dot
, I have posted an issue to get Triton community's opinion about this limitation.
https://github.com/triton-lang/triton/blob/8e96b71b1b47a5d09f1cfb1826a16178f58dbef0/python/triton/language/semantic.py#L1373
Issued a PR to try to move the restriction under CUDA backend.
AMD has different hardware-supported gemm size like Intel. They have a PR working on this. And the reviewer requests changes about the restriction according to my PR. So my PR has been closed. See this comment.
Waiting for the PR above to land to upstream.
for gemm with small M case, we have special consideration to optimize it in our poc branch. like "skip prefetch dot A operand", "relax upstream semantic limitation" we need to make sure the changes are necessary and merged.