Autotuning takes a while and for us most of that time is actually spent compiling the JIT kernel for each configuration rather than running the code. Since this process happens on the host CPU and should not affect timings it would be nice if it could be run in parallel and then once that is done all the configurations could be tested on the GPU linearly. Is this something that might be worth supporting?
Autotuning takes a while and for us most of that time is actually spent compiling the JIT kernel for each configuration rather than running the code. Since this process happens on the host CPU and should not affect timings it would be nice if it could be run in parallel and then once that is done all the configurations could be tested on the GPU linearly. Is this something that might be worth supporting?