domenicquirl commented 4 years ago

In the discussion about performance improvements (#50), a point that arose repeatedly is the impact of the crossbeam_epoch-based garbage collection we use to track map entries (nodes and values).

There are several angles from which we might look at reducing this impact. This issue is meant to be for discussion and changes around how and when our code interacts with garbage collection. Ideas floated in https://github.com/jonhoo/flurry/pull/72#pullrequestreview-378586231 and previous comments may serve as a starting point. They include:

moving direct interactions with crossbeam_epoch, e.g. in the form of defer_destroy, out of critical sections in favour of a simpler collection mechanism for things to be deleted. The collected entities could then be handed of to garbage collection after dropping the lock for the critical section. It is yet unclear whether this would yield a significant improvement over calling defer_destroy directly from within the critical section (as happens currently), as crossbeam essentially also stores a closure for the drop and takes some measures to optimize the storage of these closures (for example, a deferred closure only requires a heap allocation if it exceeds a certain size).
optimizing defer_destroy and its usage itself. See again https://github.com/jonhoo/flurry/pull/72#pullrequestreview-378586231

jonhoo commented 4 years ago

Another thing to look into is if we an make fewer calls to defer_destroy by batching up all the garbage any given method call accumulates, and then freeing all of it in one defer closure instead.

ibraheemdev commented 2 years ago

102 addresses the performance issues, but it also means that each value now comes along with some extra memory required for the intrusive linked lists used by `seize`. It would be possible to delay allocating some of the memory until a value is actually retired, but that then means that `retire` has to allocate, and there are is some extra pointer indirection for every thread that decrements the reference count of a retired batch. The final thread will also have to go through the pointer of every value in the batch. This should reduce memory usage in general, but will likely have a significant performance impact for delete heavy workloads.

ibraheemdev commented 2 years ago

I ran some tests with Linked<T> removed all together and saw pretty significant performance boosts (~%30) in read heavy workloads, with diminishing returns as thread counts rose, eventually meeting at $NUM_CPUS. In a mixed insert/delete workload there was a constant overhead of ~20% for the extra alloc/dealloc when retiring. This could likely be amortized by reusing list nodes.

ibraheemdev commented 2 years ago

All right, so here is what I am thinking.

Collectors store a list of pre-allocated nodes. To avoid frequent contention on this list, nodes are grouped into batches (the same batches that threads currently use to collector retired values before retirement).
When values are retired, threads try the following steps in order of priority:
1. Take a node from a local batch.
2. Take a batch from the global list, and ^^^.
3. Allocate a node.
When batches are reclaimed, threads add them to the global list for reuse.

Batch reuse would be optional. If a read-heavy program cares more about memory usage than allocation overhead, it can opt-out. Retirement will then always involve allocating a node, and reclamation will involve an extra dealloc.

Another option would be for the thread that reclaims a batch to first try to take it for itself, before returning it to the collector. However, I'm not sure this optimization is worthwhile as reader threads may end up taking batches that they never end up using, and writer threads are likely to already have taken a batch.

Using this strategy, Linked<T> will only need to reserve an extra usize to record the current epoch value, as opposed to storing the node inline. I would like to make it possible to optionally create nodes on link (as is the current behavior), but I'm not sure this would be possible without some funky generics, and the scheme outlined above seems pretty good :)

jonhoo / flurry

Optimize garbage collection #80