Closed benvanik closed 2 years ago
I'm hitting an issue that I think is related here that actually results in a compilation failure (as opposed to performance issues) because IREE doesn't handle i1 tensors well at all: https://github.com/google/iree/issues/3102#issuecomment-963409877
I'm hitting an issue that I think is related here that actually results in a compilation failure (as opposed to performance issues) because IREE doesn't handle i1 tensors well at all: #3102 (comment)
Does -iree-flow-enable-linalg-detensorize
help there? It still isn't enabled by default. (There are many things overlapping here and we should try to keep the issues focused...)
It does not, unfortunately
Also, unrelated, the only mention I see of that flag in the codebase is in its definition. It should probably have some tests...
Also, unrelated, the only mention I see of that flag in the codebase is in its definition. It should probably have some tests...
Well yeah :P I spent about three weeks trying to generate/find representative test cases and didn't get very far. Any models I tried to compile hit frontend issues, either in their own frameworks or in TF->MLIR. The MLIR code has unit test coverage in MLIR core and is mostly an optimization in IREE itself. I have been wanting to flip the flag (https://github.com/google/iree/pull/6863) to enable coverage in the few tests (if, while, collatz, etc.) that are affected.
@GMNGeoffrey @ScottTodd Stale P1 item here, please take a look when you get a chance and update or deprioritize!
I think @rsuderman was going to look at getting this work over the finish line.
HLO can only represent tensors, meaning that values that should not be tensors are still wrapped in them and operated on just as if they were real dense data. This is most easily visible with loops and conditionals, where the loop iterator initialization, increment, and condition are all in tensors. This results in host readbacks and a bunch of other extraneous work when really these should just be modeled as simple primitive values (i32/index, etc).
For example, this input loop:
is turned into the following CFG:
Which then after lowering through flow has the condition dispatched to the device and the condition read back (via
flow.tensor.load
):If instead we found these host-only values (even if only scalar tensors to start) we could run the whole loop in the host VM and avoid the readback.
This is also visible in inputs that have dynamic update slices (translated through to
flow.tensor.update
), where the current loop iterator value is needed to know where to update the tensor (mapping back to which timestep is being processed, etc). These updates need to be recorded into the command buffer on the host which means that we perform a readback to effectively compute an offset and then throw it back to the device.Other enhancements around indirect dispatch and dynamic
flow.tensor.update
(#1160) will make some of these cases not so bad when device->host data dependencies really do exist, however if we can remove all of the trivial cases without relying on that we'll have much more readable IR and much lower overhead at runtime.