iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.79k stars 604 forks source link

Combine hal.interfaces to reduce executable layout count #3502

Open benvanik opened 4 years ago

benvanik commented 4 years ago

We currently have a unique interface for all used permutations of i/o buffers. Long-term we want to be able to pack these (using a hal.binding_constraints attribute similar to the new hal.buffer_constraints attribute). Short-term we could just always pad out to the max used; since the implicit binding constraints must be met for anything to work there's no harm in making everything need the same set/binding count.

As we add hal.binding_constraints and packing in the future we can revert this specialization and having it now will allow us to optimize dispatch overhead and perform command buffer canonicalizations like those in #1155 to reduce runtime overhead.

  hal.executable @linked_vmla attributes {sym_visibility = "private"} {
    hal.interface @legacy_io_0 {
      hal.interface.binding @arg0, set=0, binding=0, type="StorageBuffer", access="Read"
      hal.interface.binding @ret0, set=0, binding=1, type="StorageBuffer", access="Write|Discard"
    }
    hal.interface @legacy_io_1 {
      hal.interface.binding @arg0, set=0, binding=0, type="StorageBuffer", access="Read"
      hal.interface.binding @arg1, set=0, binding=1, type="StorageBuffer", access="Read"
      hal.interface.binding @arg2, set=0, binding=2, type="StorageBuffer", access="Read"
      hal.interface.binding @ret0, set=0, binding=3, type="StorageBuffer", access="Write|Discard"
    }
    hal.interface @legacy_io_2 {
      hal.interface.binding @arg0, set=0, binding=0, type="StorageBuffer", access="Read"
      hal.interface.binding @arg1, set=0, binding=1, type="StorageBuffer", access="Read"
      hal.interface.binding @ret0, set=0, binding=2, type="StorageBuffer", access="Write|Discard"
    }
    hal.interface @legacy_io_3 {
      hal.interface.binding @arg0, set=0, binding=0, type="StorageBuffer", access="Read"
      hal.interface.binding @arg1, set=0, binding=1, type="StorageBuffer", access="Read"
      hal.interface.binding @ret0, set=0, binding=2, type="StorageBuffer", access="Write|Discard"
      hal.interface.binding @ret1, set=0, binding=3, type="StorageBuffer", access="Write|Discard"
    }
    hal.interface @legacy_io_4 {
      hal.interface.binding @arg0, set=0, binding=0, type="StorageBuffer", access="Read"
      hal.interface.binding @arg1, set=0, binding=1, type="StorageBuffer", access="Read"
      hal.interface.binding @ret0, set=0, binding=2, type="StorageBuffer", access="Write|Discard"
      hal.interface.binding @ret1, set=0, binding=3, type="StorageBuffer", access="Write|Discard"
      hal.interface.binding @ret2, set=0, binding=4, type="StorageBuffer", access="Write|Discard"
    }
    hal.interface @legacy_io_5 {
      hal.interface.binding @arg0, set=0, binding=0, type="StorageBuffer", access="Read"
      hal.interface.binding @ret0, set=0, binding=1, type="StorageBuffer", access="Write|Discard"
      hal.interface.binding @ret1, set=0, binding=2, type="StorageBuffer", access="Write|Discard"
    }
    hal.interface @legacy_io_6 {
      hal.interface.binding @ret0, set=0, binding=0, type="StorageBuffer", access="Write|Discard"
    }

Immediate benefits are that we will need only one !hal.executable_layout and !hal.descriptor_set_layout per module which shows even more how well we can elide allocations. Early work on #1155 could do the canonicalization checks to elide redundant descriptor bindings.

benvanik commented 4 years ago

(@antiagainst this is something you mentioned in the past - I'm thinking of throwing it in the common code path today so all vulkan/metal/llvm/etc get the same optimization for the moment)

benvanik commented 3 years ago

Blocked by #1519.

benvanik commented 3 years ago

This will be important in the GPU perf burndown to reduce our runtime overhead as it'll result in fewer calls into Metal/Vulkan, fewer chances to flush GPU pipelines on layout change, and fewer redundant GPU API resources created and managed.

benvanik commented 3 years ago

In the new streams path there is a dedicated place where this can happen in HAL/Analysis/BindingLayout.cpp. Here we analyze all dispatch sites for each export and can decide how to lay out the interfaces. The initial implementation will still be designed for push descriptor sets to minimize changes but should be extended to split resource types/static vs dynamic/max counts/etc.

The goal would be to move slowest varying bindings earlier in the sets/set and keep dynamic bindings separate from static bindings (even if in the same set). This would allow us to perform a majority of binding operations once per command buffer and then only rebind to update dynamic offsets or for one-off uses (externals and such). Since the interface is per executable export and that export may be dispatched many times from different execution regions the algorithm needs to rank across all dispatch sites which bindings would benefit from being reused vs not.

benvanik commented 3 years ago

The meta goal is that we could eliminate push descriptor sets from the HAL API. WebGPU has no support for them and Vulkan may never get core support (and our emulation is really bad today), and it would keep the API surface area smaller. Push descriptor sets are also not compatible with command buffer reuse.

allieculp commented 1 year ago

Sending to backlog due to the age of this issue - please reinstate as needed.