[Productize GEMM #1] optimize gem with small M

intel / intel-xpu-backend-for-triton

OpenAI Triton backend for Intel® GPUs

MIT License

129 stars 38 forks source link

[Productize GEMM #1] optimize gem with small M #1465

Closed Dewei-Wang-sh closed 2 months ago

Dewei-Wang-sh commented 3 months ago

for gemm with small M case, we have special consideration to optimize it in our poc branch. like "skip prefetch dot A operand", "relax upstream semantic limitation" we need to make sure the changes are necessary and merged.

etiotto commented 3 months ago

I think we need to understand the reason prefetching operand A causes a performance degradation. Collect unitrace data to determine what metric degrades is you prefetch that operand (vs not prefetching it) as a starting point. Are we running out of cache capacity ? If so we could tune how many iterations are prefetched (pipelined).

Dewei-Wang-sh commented 3 months ago

I think we need to understand the reason prefetching operand A causes a performance degradation. Collect unitrace data to determine what metric degrades is you prefetch that operand (vs not prefetching it) as a starting point. Are we running out of cache capacity ? If so we could tune how many iterations are prefetched (pipelined).

second that

aregm commented 3 months ago

@Dewei-Wang-sh, it would also be helpful to have a decent description of what is supposed to be done in the issue.

Dewei-Wang-sh commented 3 months ago

@Dewei-Wang-sh, it would also be helpful to have a decent description of what is supposed to be done in the issue.

thanks for the remind, added

quintinwang5 commented 3 months ago

Per our discussions, the feature disable prefetch for small shapes we be abandoned due to it's harmful to average performance for some shapes and just benefit max performance for limited shapes. As for the tl.dot, I have posted an issue to get Triton community's opinion about this limitation. https://github.com/triton-lang/triton/blob/8e96b71b1b47a5d09f1cfb1826a16178f58dbef0/python/triton/language/semantic.py#L1373

quintinwang5 commented 2 months ago

Issued a PR to try to move the restriction under CUDA backend.

quintinwang5 commented 2 months ago

AMD has different hardware-supported gemm size like Intel. They have a PR working on this. And the reviewer requests changes about the restriction according to my PR. So my PR has been closed. See this comment.

pbchekin commented 2 months ago

Waiting for the PR above to land to upstream.