Closed stellaraccident closed 1 week ago
If I manually add a retain before calling iree_hal_device_queue_dealloca (so that no buffer is ever truly freed), then it still fails in the same way, so I think this is more likely to be an enqueue race of some kind?
Disregard. I haven't found the root cause yet, but I isolated the code path and it is a unique case in the application code. Probably an issue there.
I have a rather specific scenario where I can trigger an ASAN fault reliably here (between the two prints):
Additional app level logging shows this:
In this case, the lines up to the last are run on an application thread and the last is printed from the deferred work queue thread. I have verbose app level logging which shows that the delete of the buffer is happening as part of the ref count decremement when exiting the scope holding the iree_hal_device_queue_dealloca call:
Note that the allocation logging is at the application level, so it is only tracking that it thinks it has dropped its last reference.
It would appear that the buffer is not being retained as it is transported to the deferred work queue? There are 14 prior async deallocations which work fine. Or there is some other subtle race. FWIW, this specific case seems to be the only one where the deferred work item is processed truly concurrently. All of the prior ones only get processed after some long delay. So maybe an enqueue race.