mrakgr / The-Spiral-Language

Functional language with intensional polymorphism and first-class staging.
Mozilla Public License 2.0
919 stars 27 forks source link

The Cuda compiler is slow to compile the NL Holdem game #29

Open mrakgr opened 3 months ago

mrakgr commented 3 months ago

I've just finished the HU NL Holdem game that runs on the GPU as a part of the Spiral playlist. The screencast showing the implementation should come out in mid-June 2024, most likely before you investigate this issue.

What I've found during my work on the game is that the NVRTC (and NVCC) compilers are incredibly slow to compile the game. It takes as much as 20s for the optimized version and 50s for the non-optimized one, and the primary culprit adding to that huge discrepancy between the examples is how long it takes it to compile the serializers.

Here are a few test cases:

(19s) https://github.com/mrakgr/The-Spiral-Language/blob/slow_compilation_of_union_types_by_nvrtc/Spiral%20Compilation%20Tests/cuda_experiments/ml3/game/nl_hu_holdem/compilation_time_test1.py (24s) https://github.com/mrakgr/The-Spiral-Language/blob/slow_compilation_of_union_types_by_nvrtc/Spiral%20Compilation%20Tests/cuda_experiments/ml3/game/nl_hu_holdem/compilation_time_test1_bitfield.py

(37s) https://github.com/mrakgr/The-Spiral-Language/blob/slow_compilation_of_union_types_by_nvrtc/Spiral%20Compilation%20Tests/cuda_experiments/ml3/game/nl_hu_holdem/compilation_time_test2.py (51s) https://github.com/mrakgr/The-Spiral-Language/blob/slow_compilation_of_union_types_by_nvrtc/Spiral%20Compilation%20Tests/cuda_experiments/ml3/game/nl_hu_holdem/compilation_time_test2_bitfield.py

What each of those scripts does is run the NL Holdem game on the GPU for a single step. The first two examples do less serialization of cards and pass them to and fro as raw integers between the kernel and the main memory, while that last 2 examples, use discriminated union types to represent the cards and need to do conversion between them and the raw int representation.

Also, I also did some testing whether using bit fields to specify the size of the union type's tag affects compilation and it does significantly. As a result, I've modified the Spiral compiler not to use bit fields for tags anymore.

I don't know why the Cuda compiler is so slow to compile a game, and it might be the case that you have inefficiencies in the compiler leading to exponential blowups in compilation time. If it takes this long to compile NL Holdem, I cannot imagine how long it would take to compile something bigger. If you ever get around to optimizing the compilation speed in the Cuda compiler, then keep these examples in mind. I bet you don't get many people writing games on the GPU directly, so they might be good targets.

To run the scripts, you are going to have to install CuPy. Here is my system info.

PS D:\Users\Marko\Source\Repos\The Spiral Language\VS Code Plugin> python -c "import cupy; cupy.show_config()"
OS                           : Windows-10-10.0.22631-SP0
Python Version               : 3.11.6
CuPy Version                 : 13.0.0
CuPy Platform                : NVIDIA CUDA
NumPy Version                : 1.26.1
SciPy Version                : None
Cython Build Version         : 0.29.36
Cython Runtime Version       : None
CUDA Root                    : C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3
nvcc PATH                    : C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin\nvcc.EXE
CUDA Build Version           : 12020
CUDA Driver Version          : 12030
CUDA Runtime Version         : 12020 (linked to CuPy) / 12030 (locally installed)
cuBLAS Version               : (available)
cuFFT Version                : 11012
cuRAND Version               : 10304
cuSOLVER Version             : (11, 5, 4)
cuSPARSE Version             : (available)
NVRTC Version                : (12, 3)
Thrust Version               : 200200
CUB Build Version            : 200200
Jitify Build Version         : b0269c8
cuDNN Build Version          : (not loaded; try `import cupy.cuda.cudnn` first)
cuDNN Version                : (not loaded; try `import cupy.cuda.cudnn` first)
NCCL Build Version           : None
NCCL Runtime Version         : None
cuTENSOR Version             : None
cuSPARSELt Build Version     : None
Device 0 Name                : NVIDIA GeForce RTX 4060
Device 0 Compute Capability  : 89
Device 0 PCI Bus ID          : 0000:01:00.0

The way I am running the game itself in the video is by pressing Ctrl+Shift+B, to activate the Terminal -> Run Build Task command. Before that, you'll have to run npm install in the ml3 folder to install the relevant Node packages first.

mrakgr commented 2 months ago

I significantly upgraded the Cuda backend for Spiral with C++ style shared pointer reference counting so now it supports the full range of features like recursive union types and closures. I also implemented heap allocated arrays as a part of the expanded library and used those to replace the static arrays lists used in the script here. The results are disappointing. It only cut the compilation times from 22 to 19s. I was hoping it would bring it down to 2-3s. The video on it will be released on the 28th of this month.