Open ezhulenev opened 1 year ago
Adding @benvanik to review
This feels like something that should be handled a level or two above IREE. IREE already has support for defining multiple functions and switching between code paths. A framework could choose to run through warmup / benchmark / tuning iterations and then decide which to use for the rest of it's invocations. That decision could be saved so future runs remember the results until the tuning results are invalidated (on system restart / device migration / etc.).
Do you have an example of switching between code paths
? Is it a regular scf.if
in the function body that switches between dispatches? Any examples querying execution timing programmatically?
I still like the idea of hal.executable.autotuning.export
operation as a "syntactic sugar" that gets expanded into util.initializer
region + timing + exported function selection, because I don't want to emit this low level IR directly. I can do it in my custom dialect initially.
High level idea of the expansion I think of now:
util.global @selected_ordinal : i32
util.initializer {
for (func : exported) {
run & measire
}
util.global.store @selected_ordinal
}
And then I can do scf.switch
to dispatch correct exported function? However wouldn't it break command buffers construction? Can I have a global of SymbolRef
and do something like:
%best_func_ref = util.global.load @selected_func : !hal.symbol.ref
flow.dispatch %best_func_ref
Do you have an example of
switching between code paths
?
hal.device.switch
is somewhat specialized, vm.cond_br
is general purpose control flow.
we should have a meeting - hard to work through on an issue :)
DumpExecutableBenchmarks shows how to manually build HAL stuff - that's what'd be easiest here. For switching between the easiest way would be to have multiple functions in an executable and then a function that takes an arg and switches between them - that way the whole pipeline sees a single dispatch but the implementation changes based on what the global arg is. The arg will be uniform and 100% predictable and shouldn't have any perf impact (common technique in ubershaders).
Request description
Triton heavily relies on run time auto-tuning to select the best kernel at runtime, tunable parameters are typically tile/block size, and they also impact the grid size.
Currently it's possible to compile Triton IR to a single custom dispatch, but there is not clear way to do auto-tuning today.
Proposal: let's add an explicit operation for auto-tunable groups of exports.
Example:
At run time we'll have two options:
hal.executable.autotuning.export
will start with round-robin kernel selection to collect statistics, once numbers are stable it will always run the best kernel@__autotune
function (similar to globals initialization) that will do auto tuning with fake data, so that we get reproducible runs of the "main" computetionQuestions:
What component(s) does this issue relate to?
Compiler, Runtime
Additional context
No response