Closed mrakgr closed 5 months ago
The problem with not having immediate offsets is that it is causing the register usage to ballon, and because of that I cannot interleave the loads with the computation without running out of them. This ruins the performance.
LDGSTS allows 12 bits of signed immediate offsets in the source operand and 20 bits in the destination operand. The offsets you've highlighted won't fit in the instruction.
That answers it.
Godbolt link. Raw code.
The quality of the generated code by the aforementioned function is atrocious. It makes the pipelines pointless.
As you can see here in the Cuda code the offsets are statically known.
In the PTX they cannot be found.
In the SASS, only the destination operand has an immediate offset.
This issue is preventing me from completing the matrix multiply kernel and moving on to working on the rest of the machine learning library. Why is it happening?