Profile Guided Optimization

Profile Guided Optimization (PGO)
- Multiple versions of Megakernels compiled within a single CUDA module
- Eliminate the usage of multiple Macros
Critical enhancement and bug fix for tracing

Comments are appreciated!

This doesn't include the scripts that collect traces of different configurations and generate config recommendation for specific nodes, since they are still ad-hoc but straighthood. To summarize, we repeat the same run for different possible configs (for now, it is only #blocks ranging from 1 to 6), and based on the breakdown time cost of each node within Megakernels, we can pick right config (#blocks_per_SM) for specific nodes and split them out from the Megakernel. The current implementation allows for the generation of an offline configuration file in JSON format. The library can then load this file dynamically to build a CUDA-graph tailored for the task at hand. To reproduce the speedup for hide&seek, we can simply have a JSON file containing following (key is the node id and value is the right #blocks_per_sm, and this is obtained with scripts mentioned above):

{"144": 2, "148": 2, "159": 2, "173": 2, "187": 2, "201": 2, "218": 4, "220": 4}

And then pass it during the runtime, e.g.,

$ MADRONA_MWGPU_PROFILE_CONFIG_FILE=/path/to/node_config.json MADRONA_RENDER_NOOP=1 PYTHONPATH=. python gpu_hideseek/scripts/benchmark.py 16384 940 1 0

shacklettbp / madrona

Profile Guided Optimization #9