[RFC] Export multiple executables as a part of auto-tune group?

ezhulenev commented 1 year ago

Request description

Triton heavily relies on run time auto-tuning to select the best kernel at runtime, tunable parameters are typically tile/block size, and they also impact the grid size.

Currently it's possible to compile Triton IR to a single custom dispatch, but there is not clear way to do auto-tuning today.

Proposal: let's add an explicit operation for auto-tunable groups of exports.

Example:

  hal.executable.source private @executable attributes {
    objects = #hal.executable.objects<{
      #nvptx_sm_80_target = [
        #hal.executable.object<{
          path = "path/to/a/ptx/with/all/triton/kernels"
        }>
      ]
    }>
  } {

    // All variants must have the same layout
    hal.executable.autotuning.export @matmul
      variants([@matmul_tile_32x32, @matmul_tile_64x64])

    hal.executable.export public @matmul_tile_32x32 ordinal(0)
        layout(...) attributes { workgroup_size = [64 : index, 1 : index, 1 : index] } {
    ^bb0(%device: !hal.device, %workload: index):
      hal.return %grid_for_32x32_tiling 
    }

    hal.executable.export public @matmul_tile_64x64 ordinal(1)
        layout(...) attributes { workgroup_size = [64 : index, 1 : index, 1 : index] } {
    ^bb0(%device: !hal.device, %workload: index):
      hal.return %grid_for_64x64_tiling 
    }

  }  // hal.executable.source

At run time we'll have two options:

hal.executable.autotuning.export will start with round-robin kernel selection to collect statistics, once numbers are stable it will always run the best kernel
Add an @__autotune function (similar to globals initialization) that will do auto tuning with fake data, so that we get reproducible runs of the "main" computetion
Automatically do auto tuning and module initialization time?

Questions:

Should all exported variants be a part of a single executable (PTX)?

What component(s) does this issue relate to?

Compiler, Runtime

Additional context

No response

allieculp commented 1 year ago

Adding @benvanik to review

ScottTodd commented 1 year ago

This feels like something that should be handled a level or two above IREE. IREE already has support for defining multiple functions and switching between code paths. A framework could choose to run through warmup / benchmark / tuning iterations and then decide which to use for the rest of it's invocations. That decision could be saved so future runs remember the results until the tuning results are invalidated (on system restart / device migration / etc.).

ezhulenev commented 1 year ago

Do you have an example of switching between code paths? Is it a regular scf.if in the function body that switches between dispatches? Any examples querying execution timing programmatically?

I still like the idea of hal.executable.autotuning.export operation as a "syntactic sugar" that gets expanded into util.initializer region + timing + exported function selection, because I don't want to emit this low level IR directly. I can do it in my custom dialect initially.

High level idea of the expansion I think of now:

util.global @selected_ordinal : i32

util.initializer {

for (func : exported) {
  run & measire
}

util.global.store @selected_ordinal
}

And then I can do scf.switch to dispatch correct exported function? However wouldn't it break command buffers construction? Can I have a global of SymbolRef and do something like:

%best_func_ref = util.global.load @selected_func : !hal.symbol.ref
flow.dispatch %best_func_ref

ScottTodd commented 1 year ago

Do you have an example of switching between code paths?

hal.device.switch is somewhat specialized, vm.cond_br is general purpose control flow.

https://github.com/openxla/iree/blob/6a61d9f5e0418fd8322306471cee613c2b31efe7/compiler/src/iree/compiler/Dialect/HAL/IR/HALOps.td#L1298-L1376

benvanik commented 1 year ago

we should have a meeting - hard to work through on an issue :)

DumpExecutableBenchmarks shows how to manually build HAL stuff - that's what'd be easiest here. For switching between the easiest way would be to have multiple functions in an executable and then a function that takes an arg and switches between them - that way the whole pipeline sees a single dispatch but the implementation changes based on what the global arg is. The arg will be uniform and 100% predictable and shouldn't have any perf impact (common technique in ubershaders).

iree-org / iree