ulysseB / telamon

A framework to find good combinations of optimizations for computational kernels on GPUs.
https://ulysseb.github.io/telamon/telamon
Apache License 2.0
23 stars 6 forks source link

Use a single PTX compilation path #302

Closed Elarnon closed 4 years ago

Elarnon commented 4 years ago

We currently have two different paths for PTX compilation:

The first method appears to be the oldest, and is used in the final benchmarks at the end of the search. The second method is used during the search: by separating the PTX -> cubin step from the running step, it gives us more control and frees up the thread responsible from running kernels from having to do JIT compilation of PTX assembly.

Unfortunately, in some cases (at least on our local Maxwell and Pascal machines using CUDA 10.1) those two paths do not generate the same cubin assembly, even though we pass the same compilation options in both cases. The different cubins can have wildly different performance characteristics, and the final benchmarks can hence be significantly worse than what was seen during the run.

This patch removes the first path, so that all PTX compilations go through the multi-stage path. This ensures that the final benchmarking (either during the search or post-hoc benchmarking using tlcli bench) actually uses the same device code that was used during the search.