Open alexbaden opened 6 months ago
Where does it come from?
triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 98304, Hardware limit: 65536
Hardware limit: 65536
looks strange, A770 should have more than that. Where did you run the tests?
It's shared memory - A770 only has 64K, PVC has 128K. For some of these tests we will need to reduce the size / skip based on shape, but for others we will need DPAS or some other codegen supposedly because the unrolled mma puts too much register pressure on the GPU.
A770 is not using DPAS for any of the test_dot
kernels - they are all fully unrolled scalar multiplies + adds. This results in very long kernels and out of resources error (mostly running out of registers, we think). To fix, we can either try and use DPAS on A770 (#991) or we could try not unrolling the loop to save on shared memory and register pressure - but this latter option may be unacceptably slow.
These are likely related to issues running fused multiply add w/out dpas.