shacklettbp / madrona

MIT License
254 stars 23 forks source link

Profile Guided Optimization #9

Closed xiezhq-hermann closed 1 year ago

xiezhq-hermann commented 1 year ago

Comments are appreciated!

This doesn't include the scripts that collect traces of different configurations and generate config recommendation for specific nodes, since they are still ad-hoc but straighthood. To summarize, we repeat the same run for different possible configs (for now, it is only #blocks ranging from 1 to 6), and based on the breakdown time cost of each node within Megakernels, we can pick right config (#blocks_per_SM) for specific nodes and split them out from the Megakernel. The current implementation allows for the generation of an offline configuration file in JSON format. The library can then load this file dynamically to build a CUDA-graph tailored for the task at hand. To reproduce the speedup for hide&seek, we can simply have a JSON file containing following (key is the node id and value is the right #blocks_per_sm, and this is obtained with scripts mentioned above):

{"144": 2, "148": 2, "159": 2, "173": 2, "187": 2, "201": 2, "218": 4, "220": 4}

And then pass it during the runtime, e.g.,

$ MADRONA_MWGPU_PROFILE_CONFIG_FILE=/path/to/node_config.json MADRONA_RENDER_NOOP=1 PYTHONPATH=. python gpu_hideseek/scripts/benchmark.py 16384 940 1 0