keithwill / TrieHard

A playground for building an efficient key and key prefix lookup collection.
MIT License
12 stars 1 forks source link

Consider RadixTree - Merging Redundant Keys on Set #6

Open keithwill opened 1 year ago

keithwill commented 1 year ago

Most of the implementations in this library do not remove nodes when values are removed. This simplifies concurrency concerns and would have less impact on performance as values are added and removed in repetition. This could lead to increasing memory use for unused keys over time, but due to the nature of how these Tries are expected to be used, that shouldn't be a concern. I should verify that concern with feedback from the community separately.

The Radix Tree, on the other hand, can take advantage of combining node values as values are added and removed. This can lead to improved performance for a Radix Tree with a long life that has had many keys added and removed from it.

We should investigate if unused nodes can be merged during set operations.

  1. Without negatively impacting the concurrency support of the Radix Tree
  2. Without negatively impacting the performance of reads and searches
  3. Without creating excessive garbage with the operations

There are some calling convention and semantic details that complicate this approach. Because we are using nullable values (with no type constraints) for the payload, and each node does not have an indicator of if a value is stored, it can be complicated to understand if a Node is used or not. A consumer of the API may explicitly set null for a key. When they prefix search for keys, would they expect that key to come back with a null value, or to not be included? If we add an indicator for if a node has a user set value (even if its null) that will cost some memory. At the very least it will cost a byte per node (which may be acceptable).

A risk of implementing this is that the Radix Tree already has complicated logic when setting node values (due to splitting and key sharing logic), and adding an additional layer on top of that may make it more difficult to improve the set speed of the collection or improve its concurrency support in the future.

Current work arounds that would give similar results: For a very long lived RadixTree used for caching and lookups where the developer controls the process that applies updates to the tree, a dev could create pause updates to the tree and make a fresh copy to swap. This would be a periodic maintenance operation.

We could also implement a method that a dev could call that would perform the same type of maintenance without creating a new reference (doing the merging in-place), though without flags indicating what nodes have user-set values (even nulls) we won't know which nodes can be safely merged.