Open stbaione opened 1 week ago
Good idea - we should make the HAL module query the error by calling iree_hal_fence_query
if the result of the wait is ABORTED.
Actually doing it in the HAL module fence await is tricky due to coroutine yielding. There's no status type in the VM yet and maybe it's time to add one so we could do the logic in compiler-produced code - that's better for being able to handle retries/backoffs/etc anyway. Native status support would be useful for other things too. The boxing will be a bit ugly and I'll have to think of how best to do that (maybe by having statuses always be split code and optional !vm.buffer
containing the status payload). vm.fail
and other opcodes today take the status code and a message and I think making that message a !vm.buffer
will let us do both pass-through statuses that originate from native code and dynamically generated messages from the compiled program itself (if anyone builds a strfmt or whatnot).
Request description
Recently I began attempting to run shortfin with an IREE compiled llama_8b_f16 decomposed model, and hit an error on attempted startup. We're in the process of troubleshooting the specific error, but it did raise a possibility to improve the IREE VM error reporting to be a bit more useful.
Below is the error output, which includes a failure in
hal.fence.await
due to a semaphore failing that it was awaiting:(This is followed by a large mlir dump, but will omit that to save space, since it seems to be unrelated to this request)
As you can see the
hal.fence.await
error is reported, but not the actual semaphore error that caused the issue. As I understand when any underlying semaphore fails, it has to return ABORTED.We believe the error collection behavior may need to be changed, to either:
hal.fence.await
chain the error.What component(s) does this issue relate to?
Runtime
Additional context
No response