mrakgr / The-Spiral-Language

Functional language with intensional polymorphism and first-class staging.
Mozilla Public License 2.0
919 stars 27 forks source link

The `warp_sync_load` bug info thread #23

Closed mrakgr closed 3 months ago

mrakgr commented 5 months ago

To find out the issue with the warp_sync_load first we need to isolate them.

image

Once the loops are fully unrolled, the conversions will result in a lot of noise, so we'll comment them out.

image

To make it as easy as possible to detect, I recommend doing that for the compute_tf32gemm kernel.

image

Then just compile the thing and run it in NSight Compute. In the source you should see:

image image

If you look at the SASS assembly for the body of the loop, you should see a lot of integer arithmetic going on. And none of the shared loads have immediate offsets. My hunch is that some loops aren't being unrolled. All of these poorly done shared loads are from loading the A and B matrices. That is what you want to be on the lookout for.

As a matter of fact, I've written my own loading functions to work around this issue and they look like this in the output. Here is how it looks in the latest version of the matrix multiply kernel.

image image

You can see that the main loop shared loads should look like this. They have efficient immediate offsets instead of using int arithmetic to calculate the load addresses. The integer arithmetic and the predicate sets is due to me leaving in the tf32 conversions otherwise the loop body would just be shared loads and HMMA instructions.

mrakgr commented 5 months ago

Also, make sure you select the right kernel before running the example:

image
mrakgr commented 5 months ago

tensor14.zip

Here is the example for the above report. You can try running the script yourself, the only dependency is CuPy and CTK 12.3.

mrakgr commented 5 months ago

tensor14.zip

Here is the example that produced the above report. It's a lot worse that I thought it would be. Good thing I wrote my own loading functions.

Incidentally, while I was writing the matrix multiplication kernel, I made a version which uses async loading functionality, but I couldn't get it to perform as well as the synchronous version you see a few commits ago on the master branch. If you have any tips for how to improve tensor13 I'd appreciate it.

mrakgr commented 5 months ago

For the reference, here is my machine:

PS D:\Users\Marko\Source\Repos\The Spiral Language\Spiral Compilation Tests> python -c "import cupy; cupy.show_config()"
OS                           : Windows-10-10.0.22631-SP0
Python Version               : 3.11.6
CuPy Version                 : 13.0.0
CuPy Platform                : NVIDIA CUDA
NumPy Version                : 1.26.1
SciPy Version                : None
Cython Build Version         : 0.29.36
Cython Runtime Version       : None
CUDA Root                    : C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3
nvcc PATH                    : C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin\nvcc.EXE
CUDA Build Version           : 12020
CUDA Driver Version          : 12030
CUDA Runtime Version         : 12020 (linked to CuPy) / 12030 (locally installed)
cuBLAS Version               : (available)
cuFFT Version                : 11012
cuRAND Version               : 10304
cuSOLVER Version             : (11, 5, 4)
cuSPARSE Version             : (available)
NVRTC Version                : (12, 3)
Thrust Version               : 200200
CUB Build Version            : 200200
Jitify Build Version         : b0269c8
cuDNN Build Version          : (not loaded; try `import cupy.cuda.cudnn` first)
cuDNN Version                : (not loaded; try `import cupy.cuda.cudnn` first)
NCCL Build Version           : None
NCCL Runtime Version         : None
cuTENSOR Version             : None
cuSPARSELt Build Version     : None
Device 0 Name                : NVIDIA GeForce RTX 4060
Device 0 Compute Capability  : 89
Device 0 PCI Bus ID          : 0000:01:00.0
mrakgr commented 5 months ago

Here is the link to the playlist I was working on in case you want to see how the matmult kernel was build from the ground up.