Closed mrakgr closed 3 months ago
Also, make sure you select the right kernel before running the example:
Here is the example for the above report. You can try running the script yourself, the only dependency is CuPy and CTK 12.3.
Here is the example that produced the above report. It's a lot worse that I thought it would be. Good thing I wrote my own loading functions.
Incidentally, while I was writing the matrix multiplication kernel, I made a version which uses async loading functionality, but I couldn't get it to perform as well as the synchronous version you see a few commits ago on the master branch. If you have any tips for how to improve tensor13 I'd appreciate it.
For the reference, here is my machine:
PS D:\Users\Marko\Source\Repos\The Spiral Language\Spiral Compilation Tests> python -c "import cupy; cupy.show_config()"
OS : Windows-10-10.0.22631-SP0
Python Version : 3.11.6
CuPy Version : 13.0.0
CuPy Platform : NVIDIA CUDA
NumPy Version : 1.26.1
SciPy Version : None
Cython Build Version : 0.29.36
Cython Runtime Version : None
CUDA Root : C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3
nvcc PATH : C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin\nvcc.EXE
CUDA Build Version : 12020
CUDA Driver Version : 12030
CUDA Runtime Version : 12020 (linked to CuPy) / 12030 (locally installed)
cuBLAS Version : (available)
cuFFT Version : 11012
cuRAND Version : 10304
cuSOLVER Version : (11, 5, 4)
cuSPARSE Version : (available)
NVRTC Version : (12, 3)
Thrust Version : 200200
CUB Build Version : 200200
Jitify Build Version : b0269c8
cuDNN Build Version : (not loaded; try `import cupy.cuda.cudnn` first)
cuDNN Version : (not loaded; try `import cupy.cuda.cudnn` first)
NCCL Build Version : None
NCCL Runtime Version : None
cuTENSOR Version : None
cuSPARSELt Build Version : None
Device 0 Name : NVIDIA GeForce RTX 4060
Device 0 Compute Capability : 89
Device 0 PCI Bus ID : 0000:01:00.0
To find out the issue with the
warp_sync_load
first we need to isolate them.Once the loops are fully unrolled, the conversions will result in a lot of noise, so we'll comment them out.
To make it as easy as possible to detect, I recommend doing that for the
compute_tf32gemm
kernel.Then just compile the thing and run it in NSight Compute. In the source you should see:
If you look at the SASS assembly for the body of the loop, you should see a lot of integer arithmetic going on. And none of the shared loads have immediate offsets. My hunch is that some loops aren't being unrolled. All of these poorly done shared loads are from loading the A and B matrices. That is what you want to be on the lookout for.
As a matter of fact, I've written my own loading functions to work around this issue and they look like this in the output. Here is how it looks in the latest version of the matrix multiply kernel.
You can see that the main loop shared loads should look like this. They have efficient immediate offsets instead of using int arithmetic to calculate the load addresses. The integer arithmetic and the predicate sets is due to me leaving in the tf32 conversions otherwise the loop body would just be shared loads and HMMA instructions.