iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.85k stars 620 forks source link

Suspected lifetime issue in HIP asycn dealloc #19150

Closed stellaraccident closed 1 week ago

stellaraccident commented 1 week ago

I have a rather specific scenario where I can trigger an ASAN fault reliably here (between the two prints):

iree_status_t iree_hal_hip_allocator_free_async(
    iree_hal_allocator_t* base_allocator, hipStream_t stream,
    iree_hal_buffer_t* buffer) {
  iree_hal_hip_allocator_t* allocator =
      iree_hal_hip_allocator_cast(base_allocator);
  IREE_RETURN_IF_ERROR(
      iree_hal_hip_set_context(allocator->symbols, allocator->hip_context));

  printf("*** iree_hal_hip_allocator_free_async: %p\n", buffer);
  hipDeviceptr_t device_ptr = iree_hal_hip_buffer_device_pointer(buffer);
  printf("*** </iree_hal_hip_allocator_free_async>\n");
  if (!device_ptr) {
    return iree_ok_status();
  }

  IREE_RETURN_IF_ERROR(IREE_HIP_RESULT_TO_STATUS(
      allocator->symbols, hipFreeAsync(device_ptr, stream), "hipFreeAsync"));
  iree_hal_hip_buffer_set_allocation_empty(buffer);

  IREE_TRACE_FREE_NAMED(IREE_HAL_HIP_ALLOCATOR_ID, (void*)device_ptr);
  IREE_STATISTICS(iree_hal_allocator_statistics_record_free(
      &allocator->statistics, iree_hal_buffer_memory_type(buffer),
      iree_hal_buffer_allocation_size(buffer)));
  return iree_ok_status();
}

Additional app level logging shows this:

[2024-11-13 22:04:40.134] [info] +++ RUN DTOR: 0x60800009ab20
[2024-11-13 22:04:40.134] [info] SCHED: async dealloca(device=0x613000afad80, affinity=1, buffer=0x60800009ab20):[Wait:(0x60b000d4ef70@26), Signal:()]
[2024-11-13 22:04:40.134] [info] SCHED: </CreateAsyncBufferDestructor DTOR>
[2024-11-13 22:04:40.134] [info] IREE delete hal_buffer_t(0x60800009ab20)
*** iree_hal_hip_allocator_free_async: 0x60800009ab20

In this case, the lines up to the last are run on an application thread and the last is printed from the deferred work queue thread. I have verbose app level logging which shows that the delete of the buffer is happening as part of the ref count decremement when exiting the scope holding the iree_hal_device_queue_dealloca call:

TimelineResourceDestructor TimelineResource::CreateAsyncBufferDestructor(
    ScopedDevice &scoped_device, iree::hal_buffer_ptr buffer) {
  logging::info("+++ CREATE DTOR: {}", static_cast<void*>(buffer.get()));
  //iree::detail::hal_buffer_ptr_helper::retain(buffer.get());
  return [device = iree::hal_device_ptr::borrow_reference(
              scoped_device.raw_device()->hal_device()),
          affinity = scoped_device.affinity().queue_affinity(),
          buffer = std::move(buffer)](TimelineResource &res) {
    logging::info("+++ RUN DTOR: {}", static_cast<void*>(buffer.get()));
    SHORTFIN_TRACE_SCOPE_NAMED("TimelineResource::AsyncBufferDestructor");
    iree_hal_semaphore_list_t wait_semaphore_list = res.use_barrier();
    iree_hal_semaphore_list_t signal_semaphore_list =
        iree_hal_semaphore_list_empty();
    if (SHORTFIN_SCHED_LOG_ENABLED) {
      auto wait_sum = iree::DebugPrintSemaphoreList(wait_semaphore_list);
      auto signal_sum = iree::DebugPrintSemaphoreList(signal_semaphore_list);
      SHORTFIN_SCHED_LOG(
          "async dealloca(device={}, affinity={:x}, buffer={}):[Wait:{}, "
          "Signal:{}]",
          static_cast<void *>(device.get()), affinity,
          static_cast<void *>(buffer.get()), wait_sum, signal_sum);
    }
    SHORTFIN_THROW_IF_ERROR(iree_hal_device_queue_dealloca(
        device, affinity, wait_semaphore_list, signal_semaphore_list, buffer));
    SHORTFIN_SCHED_LOG("</CreateAsyncBufferDestructor DTOR>");
  };
}

Note that the allocation logging is at the application level, so it is only tracking that it thinks it has dropped its last reference.

It would appear that the buffer is not being retained as it is transported to the deferred work queue? There are 14 prior async deallocations which work fine. Or there is some other subtle race. FWIW, this specific case seems to be the only one where the deferred work item is processed truly concurrently. All of the prior ones only get processed after some long delay. So maybe an enqueue race.

stellaraccident commented 1 week ago

If I manually add a retain before calling iree_hal_device_queue_dealloca (so that no buffer is ever truly freed), then it still fails in the same way, so I think this is more likely to be an enqueue race of some kind?

stellaraccident commented 1 week ago

Disregard. I haven't found the root cause yet, but I isolated the code path and it is a unique case in the application code. Probably an issue there.