The Cuda compiler is slow to compile the NL Holdem game

I've just finished the HU NL Holdem game that runs on the GPU as a part of the Spiral playlist. The screencast showing the implementation should come out in mid-June 2024, most likely before you investigate this issue.

What I've found during my work on the game is that the NVRTC (and NVCC) compilers are incredibly slow to compile the game. It takes as much as 20s for the optimized version and 50s for the non-optimized one, and the primary culprit adding to that huge discrepancy between the examples is how long it takes it to compile the serializers.

Here are a few test cases:

(19s) https://github.com/mrakgr/The-Spiral-Language/blob/slow_compilation_of_union_types_by_nvrtc/Spiral%20Compilation%20Tests/cuda_experiments/ml3/game/nl_hu_holdem/compilation_time_test1.py (24s) https://github.com/mrakgr/The-Spiral-Language/blob/slow_compilation_of_union_types_by_nvrtc/Spiral%20Compilation%20Tests/cuda_experiments/ml3/game/nl_hu_holdem/compilation_time_test1_bitfield.py

(37s) https://github.com/mrakgr/The-Spiral-Language/blob/slow_compilation_of_union_types_by_nvrtc/Spiral%20Compilation%20Tests/cuda_experiments/ml3/game/nl_hu_holdem/compilation_time_test2.py (51s) https://github.com/mrakgr/The-Spiral-Language/blob/slow_compilation_of_union_types_by_nvrtc/Spiral%20Compilation%20Tests/cuda_experiments/ml3/game/nl_hu_holdem/compilation_time_test2_bitfield.py

What each of those scripts does is run the NL Holdem game on the GPU for a single step. The first two examples do less serialization of cards and pass them to and fro as raw integers between the kernel and the main memory, while that last 2 examples, use discriminated union types to represent the cards and need to do conversion between them and the raw int representation.

Also, I also did some testing whether using bit fields to specify the size of the union type's tag affects compilation and it does significantly. As a result, I've modified the Spiral compiler not to use bit fields for tags anymore.

I don't know why the Cuda compiler is so slow to compile a game, and it might be the case that you have inefficiencies in the compiler leading to exponential blowups in compilation time. If it takes this long to compile NL Holdem, I cannot imagine how long it would take to compile something bigger. If you ever get around to optimizing the compilation speed in the Cuda compiler, then keep these examples in mind. I bet you don't get many people writing games on the GPU directly, so they might be good targets.

To run the scripts, you are going to have to install CuPy. Here is my system info.

PS D:\Users\Marko\Source\Repos\The Spiral Language\VS Code Plugin> python -c "import cupy; cupy.show_config()"
OS                           : Windows-10-10.0.22631-SP0
Python Version               : 3.11.6
CuPy Version                 : 13.0.0
CuPy Platform                : NVIDIA CUDA
NumPy Version                : 1.26.1
SciPy Version                : None
Cython Build Version         : 0.29.36
Cython Runtime Version       : None
CUDA Root                    : C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3
nvcc PATH                    : C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin\nvcc.EXE
CUDA Build Version           : 12020
CUDA Driver Version          : 12030
CUDA Runtime Version         : 12020 (linked to CuPy) / 12030 (locally installed)
cuBLAS Version               : (available)
cuFFT Version                : 11012
cuRAND Version               : 10304
cuSOLVER Version             : (11, 5, 4)
cuSPARSE Version             : (available)
NVRTC Version                : (12, 3)
Thrust Version               : 200200
CUB Build Version            : 200200
Jitify Build Version         : b0269c8
cuDNN Build Version          : (not loaded; try `import cupy.cuda.cudnn` first)
cuDNN Version                : (not loaded; try `import cupy.cuda.cudnn` first)
NCCL Build Version           : None
NCCL Runtime Version         : None
cuTENSOR Version             : None
cuSPARSELt Build Version     : None
Device 0 Name                : NVIDIA GeForce RTX 4060
Device 0 Compute Capability  : 89
Device 0 PCI Bus ID          : 0000:01:00.0

The way I am running the game itself in the video is by pressing Ctrl+Shift+B, to activate the Terminal -> Run Build Task command. Before that, you'll have to run npm install in the ml3 folder to install the relevant Node packages first.

mrakgr / The-Spiral-Language

The Cuda compiler is slow to compile the NL Holdem game #29