Open mrakgr opened 3 months ago
I significantly upgraded the Cuda backend for Spiral with C++ style shared pointer reference counting so now it supports the full range of features like recursive union types and closures. I also implemented heap allocated arrays as a part of the expanded library and used those to replace the static arrays lists used in the script here. The results are disappointing. It only cut the compilation times from 22 to 19s. I was hoping it would bring it down to 2-3s. The video on it will be released on the 28th of this month.
I've just finished the HU NL Holdem game that runs on the GPU as a part of the Spiral playlist. The screencast showing the implementation should come out in mid-June 2024, most likely before you investigate this issue.
What I've found during my work on the game is that the NVRTC (and NVCC) compilers are incredibly slow to compile the game. It takes as much as 20s for the optimized version and 50s for the non-optimized one, and the primary culprit adding to that huge discrepancy between the examples is how long it takes it to compile the serializers.
Here are a few test cases:
(19s) https://github.com/mrakgr/The-Spiral-Language/blob/slow_compilation_of_union_types_by_nvrtc/Spiral%20Compilation%20Tests/cuda_experiments/ml3/game/nl_hu_holdem/compilation_time_test1.py (24s) https://github.com/mrakgr/The-Spiral-Language/blob/slow_compilation_of_union_types_by_nvrtc/Spiral%20Compilation%20Tests/cuda_experiments/ml3/game/nl_hu_holdem/compilation_time_test1_bitfield.py
(37s) https://github.com/mrakgr/The-Spiral-Language/blob/slow_compilation_of_union_types_by_nvrtc/Spiral%20Compilation%20Tests/cuda_experiments/ml3/game/nl_hu_holdem/compilation_time_test2.py (51s) https://github.com/mrakgr/The-Spiral-Language/blob/slow_compilation_of_union_types_by_nvrtc/Spiral%20Compilation%20Tests/cuda_experiments/ml3/game/nl_hu_holdem/compilation_time_test2_bitfield.py
What each of those scripts does is run the NL Holdem game on the GPU for a single step. The first two examples do less serialization of cards and pass them to and fro as raw integers between the kernel and the main memory, while that last 2 examples, use discriminated union types to represent the cards and need to do conversion between them and the raw int representation.
Also, I also did some testing whether using bit fields to specify the size of the union type's tag affects compilation and it does significantly. As a result, I've modified the Spiral compiler not to use bit fields for tags anymore.
I don't know why the Cuda compiler is so slow to compile a game, and it might be the case that you have inefficiencies in the compiler leading to exponential blowups in compilation time. If it takes this long to compile NL Holdem, I cannot imagine how long it would take to compile something bigger. If you ever get around to optimizing the compilation speed in the Cuda compiler, then keep these examples in mind. I bet you don't get many people writing games on the GPU directly, so they might be good targets.
To run the scripts, you are going to have to install CuPy. Here is my system info.
The way I am running the game itself in the video is by pressing Ctrl+Shift+B, to activate the
Terminal -> Run Build Task
command. Before that, you'll have to runnpm install
in theml3
folder to install the relevant Node packages first.