`cuda::memcpy_async` generates very poor SASS

mrakgr / The-Spiral-Language

Functional language with intensional polymorphism and first-class staging.

Mozilla Public License 2.0

919 stars 27 forks source link

`cuda::memcpy_async` generates very poor SASS #24

Closed mrakgr closed 5 months ago

mrakgr commented 5 months ago

Godbolt link. Raw code.

The quality of the generated code by the aforementioned function is atrocious. It makes the pipelines pointless.

As you can see here in the Cuda code the offsets are statically known.

In the PTX they cannot be found.

In the SASS, only the destination operand has an immediate offset.

This issue is preventing me from completing the matrix multiply kernel and moving on to working on the rest of the machine learning library. Why is it happening?

mrakgr commented 5 months ago

The problem with not having immediate offsets is that it is causing the register usage to ballon, and because of that I cannot interleave the loads with the computation without running out of them. This ruins the performance.

mrakgr commented 5 months ago

LDGSTS allows 12 bits of signed immediate offsets in the source operand and 20 bits in the destination operand. The offsets you've highlighted won't fit in the instruction.

That answers it.