Improve IREE VM Error Reporting

stbaione commented 3 weeks ago

Request description

Recently I began attempting to run shortfin with an IREE compiled llama_8b_f16 decomposed model, and hit an error on attempted startup. We're in the process of troubleshooting the specific error, but it did raise a possibility to improve the IREE VM error reporting to be a bit more useful.

Below is the error output, which includes a failure in hal.fence.await due to a semaphore failing that it was awaiting:

[2024-10-31 15:56:26.400] [error] [on.py:121] Traceback (most recent call last):
  File "/home/stbaione/repos/SHARK-Platform/.venv/lib/python3.12/site-packages/starlette/routing.py", line 693, in lifespan
    async with self.lifespan_context(app) as maybe_state:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stbaione/.pyenv/versions/3.12.7/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/stbaione/repos/SHARK-Platform/shortfin/python/shortfin_apps/llm/server.py", line 42, in lifespan
    service.start()
  File "/home/stbaione/repos/SHARK-Platform/shortfin/python/shortfin_apps/llm/components/service.py", line 69, in start
    self.inference_program = sf.Program(
                             ^^^^^^^^^^^
ValueError: shortfin_iree-src/runtime/src/iree/hal/drivers/hip/event_semaphore.c:350: ABORTED; while calling import; while invoking native function hal.fence.await;
[ 0] bytecode module.__init:67870 [
    llama8b_f16.mlir:1055:12,

(This is followed by a large mlir dump, but will omit that to save space, since it seems to be unrelated to this request)

As you can see the hal.fence.await error is reported, but not the actual semaphore error that caused the issue. As I understand when any underlying semaphore fails, it has to return ABORTED.

We believe the error collection behavior may need to be changed, to either:

Make the VM query the underlying semaphores for the actual error
Make the implementation of hal.fence.await chain the error.

What component(s) does this issue relate to?

Runtime

Additional context

No response

benvanik commented 3 weeks ago

Good idea - we should make the HAL module query the error by calling iree_hal_fence_query if the result of the wait is ABORTED.

benvanik commented 3 weeks ago

Actually doing it in the HAL module fence await is tricky due to coroutine yielding. There's no status type in the VM yet and maybe it's time to add one so we could do the logic in compiler-produced code - that's better for being able to handle retries/backoffs/etc anyway. Native status support would be useful for other things too. The boxing will be a bit ugly and I'll have to think of how best to do that (maybe by having statuses always be split code and optional !vm.buffer containing the status payload). vm.fail and other opcodes today take the status code and a message and I think making that message a !vm.buffer will let us do both pass-through statuses that originate from native code and dynamically generated messages from the compiled program itself (if anyone builds a strfmt or whatnot).

iree-org / iree