microsoft / mimalloc

mimalloc is a compact general purpose allocator with excellent performance.
MIT License
9.74k stars 791 forks source link

Concurrency bugs that cause hanging #890

Closed wangjwchn closed 1 month ago

wangjwchn commented 1 month ago

I believe that there are concurrency bugs in function mi_free_block_delayed_mt that cause hanging.

  mi_thread_free_t tfree = mi_atomic_load_relaxed(&page->xthread_free);
  do {
    use_delayed = (mi_tf_delayed(tfree) == MI_USE_DELAYED_FREE);
    if mi_unlikely(use_delayed) {
      // unlikely: this only happens on the first concurrent free in a page that is in the full list
      tfreex = mi_tf_set_delayed(tfree,MI_DELAYED_FREEING);
    }
    else {
      // usual: directly add to page thread_free list
      mi_block_set_next(page, block, mi_tf_block(tfree));
      tfreex = mi_tf_set_block(tfree,block);
    }
  } while (!mi_atomic_cas_weak_release(&page->xthread_free, &tfree, tfreex));

The above code segment attempts to push a node, tfreex, to the beginning of the concurrent linked-list-based stack, referred to as xthread_free.

If xthread_free is modified by other threads during the do-while loop, the program will hang because xthread_free has been changed ( != tfree), causing mi_atomic_cas_weak_release to always fail.

Moving the first line into the do-while loop can fix this issue.

The same issue occurred in another code segment of this function:

    tfree = mi_atomic_load_relaxed(&page->xthread_free);
    do {
      tfreex = tfree;
      mi_assert_internal(mi_tf_delayed(tfree) == MI_DELAYED_FREEING);
      tfreex = mi_tf_set_delayed(tfree,MI_NO_DELAYED_FREE);
    } while (!mi_atomic_cas_weak_release(&page->xthread_free, &tfree, tfreex));

The solution is similar to the one above.

mjp41 commented 1 month ago

Doesn't mi_atomic_cas_weak_release update tfree in the failure case.

E.g. it is based on: https://en.cppreference.com/w/c/atomic/atomic_compare_exchange

Have you actually observed this execution hanging execution? On what platform as it could be the atomics are not implemented correctly in that case?

wangjwchn commented 1 month ago

Just double-check the result. I believe you are correct. I mapped the mi_atomic_xxx instructions to a different atomic library, which doesn't have the failure update feature that causes the issue.