We currently have two different paths for PTX compilation:
One-shot by giving the PTX to cuModuleLoadDataEx
Multi-stage by first compiling the PTX to cubin using cuLinkCreate /
cuLinkAddData / cuLinkComplete, then giving the cubin to
cuModuleLoadData.
The first method appears to be the oldest, and is used in the final
benchmarks at the end of the search. The second method is used during
the search: by separating the PTX -> cubin step from the running step,
it gives us more control and frees up the thread responsible from
running kernels from having to do JIT compilation of PTX assembly.
Unfortunately, in some cases (at least on our local Maxwell and
Pascal machines using CUDA 10.1) those two paths do not generate the
same cubin assembly, even though we pass the same compilation options in
both cases. The different cubins can have wildly different performance
characteristics, and the final benchmarks can hence be significantly
worse than what was seen during the run.
This patch removes the first path, so that all PTX compilations go
through the multi-stage path. This ensures that the final benchmarking
(either during the search or post-hoc benchmarking using tlcli bench)
actually uses the same device code that was used during the search.
We currently have two different paths for PTX compilation:
One-shot by giving the PTX to cuModuleLoadDataEx
Multi-stage by first compiling the PTX to cubin using cuLinkCreate / cuLinkAddData / cuLinkComplete, then giving the cubin to cuModuleLoadData.
The first method appears to be the oldest, and is used in the final benchmarks at the end of the search. The second method is used during the search: by separating the PTX -> cubin step from the running step, it gives us more control and frees up the thread responsible from running kernels from having to do JIT compilation of PTX assembly.
Unfortunately, in some cases (at least on our local Maxwell and Pascal machines using CUDA 10.1) those two paths do not generate the same cubin assembly, even though we pass the same compilation options in both cases. The different cubins can have wildly different performance characteristics, and the final benchmarks can hence be significantly worse than what was seen during the run.
This patch removes the first path, so that all PTX compilations go through the multi-stage path. This ensures that the final benchmarking (either during the search or post-hoc benchmarking using tlcli bench) actually uses the same device code that was used during the search.