mrakgr / The-Spiral-Language

Functional language with intensional polymorphism and first-class staging.
Mozilla Public License 2.0
919 stars 27 forks source link

Adding `--define-macro=NDEBUG` to disable the asserts drastically increases the register occupancy of the matmul kernel #22

Closed mrakgr closed 6 months ago

mrakgr commented 7 months ago

Link.

I am working on matrix multiplication here and studying register occupancy. Adding options.append('--define-macro=NDEBUG') drastically increases register usage from 40 to 56. I'd expect that leaving in side-effecting operations like the asserts would have the opposite effect!

Why is this happening?

mrakgr commented 7 months ago

The linked script has a dependency on this file. If you don't want to clone the whole repo, just put it in the same folder.

You also need to have CuPy 12.3 installed, get it via pip install cupy-cuda12x, along with CTK 12.3. Unless I missed something, that should enable you to run the script. After you do, you'll see something like:

PowerShell 7.4.1
PS D:\Users\Marko\Source\Repos\The Spiral Language\Spiral Compilation Tests>  & 'C:\Users\mrakg\AppData\Local\Programs\Python\Python311\python.exe' 'c:\Users\mrakg\.vscode\extensions\ms-python.python-2024.0.1\pythonFiles\lib\python\debugpy\adapter/../..\debugpy\launcher' '54527' '--' 'D:\Users\Marko\Source\Repos\The Spiral Language\Spiral Compilation Tests\cuda_experiments\tensor2\matmul.py' 
Maximum number of blocks per multi processor is:                      24
The minimum due to the number of threads per multiprocessor is:       12
The minimum due to the number of registers per multi processor is:    12
The maximum number of registers per thread is:                        42
The amount of registers per thread is:                                40
The minimum due to the amount of shared memory per multiprocessor is: 6
The amount of shared memory per multiprocessor is:                    102400
The amount of shared memory per block used is:                        16384
The true minimum is:                                                  6
0.021868706

The amount of registers per thread is key here. I am running it with # options.append('--define-macro=NDEBUG') which turns off the asserts so I am getting 40. If I commented it in thereby turning off the asserts I'd get 56 instead.

mrakgr commented 7 months ago

Here is some system info just in case.

PS D:\Users\Marko\Source\Repos\The Spiral Language\Spiral Compilation Tests> python -c "import cupy; cupy.show_config()" 
OS                           : Windows-10-10.0.22631-SP0
Python Version               : 3.11.6
CuPy Version                 : 13.0.0
CuPy Platform                : NVIDIA CUDA
NumPy Version                : 1.26.1
SciPy Version                : None
Cython Build Version         : 0.29.36
Cython Runtime Version       : None
CUDA Root                    : C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3
nvcc PATH                    : C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin\nvcc.EXE
CUDA Build Version           : 12020
CUDA Driver Version          : 12030
CUDA Runtime Version         : 12020 (linked to CuPy) / 12030 (locally installed)
cuBLAS Version               : (available)
cuFFT Version                : 11012
cuRAND Version               : 10304
cuSOLVER Version             : (11, 5, 4)
cuSPARSE Version             : (available)
NVRTC Version                : (12, 3)
Thrust Version               : 200200
CUB Build Version            : 200200
Jitify Build Version         : b0269c8
cuDNN Build Version          : (not loaded; try `import cupy.cuda.cudnn` first)
cuDNN Version                : (not loaded; try `import cupy.cuda.cudnn` first)
NCCL Build Version           : None
NCCL Runtime Version         : None
cuTENSOR Version             : None
cuSPARSELt Build Version     : None
Device 0 Name                : NVIDIA GeForce RTX 4060
Device 0 Compute Capability  : 89
Device 0 PCI Bus ID          : 0000:01:00.0
mrakgr commented 7 months ago

I've tried it out in WSL, and the results I am getting there are different. Regardless of whether I have the asserts enabled, both of the time the register use is at 56. This is different from Windows where enabling asserts lowers the register use to 40 per thread.

mrakgr commented 6 months ago

Voting to close this. Now that I have some exp studying SASS, I don't see lower register count as necessarily being relevant to performance. I have some other issues that will need opening, so there is no point in spending time on this one.

It could just be due to some loops not being unrolled.

By the way on the Cuda bug report page the submit comment button is missing.

mrakgr commented 5 months ago
image

I have Win 11, so I am not sure why CuPy says Win 10.

mrakgr commented 5 months ago
image

I've been meaning to report this, so let me do it here. The submit comment button is off the screen on this page making it unusable.