keithwill / TrieHard

A playground for building an efficient key and key prefix lookup collection.
MIT License
12 stars 1 forks source link

Consider Archiving Experimental Collections #3

Closed keithwill closed 1 year ago

keithwill commented 1 year ago

TrieHard contains the following lookup implementations in TrieHard.Collections:

It may make sense to archive the implementations that do not provide any comparative advantage. This will lead to less code to maintain as changes to the IPrefixLookup interface are considered, and will reduce the size and surface area of the library as a NuGet package.

Here are my thoughts on the state of each.

FlatTrie Out of the 'safe' tries, this is the second best performing one. Specialization of key searching by child count (an optimization that was applied to the Radix Tree) could also be applied to this trie and might make its performance compelling again. Because children are stored as arrays of primitives, there could also be opportunities to use SIMD/vector operations (though the MemoryExtensions binary search that is already in use performs pretty well). This trie also has very fast value counts and unordered iterations, as the values are stored in an Array/List. It theoretically supports single writer, multiple reader...but its concurrency support relies on an odd dedicated array swap space mechanic that is hard to understand and likely could never be expanded to a purely thread safe (multi reader/writer) approach without a lot of rework.

As for downsides, it uses a substantial amount of managed memory (14gb for 5 million sequential records). I think some of this extra memory bloat might come from using ArrayPoolList, which doubles its internal buffer everytime it needs to resize (meaning 1-5gb is likely allocated and unused). Those resizes are pretty expensive and the GC pause duration are also rough. I don't think the ArrayPool.Shared implementation will hold onto arrays over around 100,000 entries, which means we're paying the cost to rent and return arrays that are likely being dropped as the trie grows in size. These are fundamental issues with using 'big array' backing for the nodes.

Tentatively I'd like to keep the Flat Trie around for a short time to see if there are ways to make its existence compelling again. With SIMD searches and improvements to its memory usage, it might serve a purpose as a 'safe' trie for fixed or short keys (such as for storing IDs) or for smaller collections...but overall it has a lot of disadvantages to overcome and I think its likely this trie will be archived.

If this trie is archived, the 'SearchPage' method should be reviewed to see if it can be added to other tries or even to the IPrefixLookup interface. Its a form of skip/take.

IndirectTrie This trie is an odd one. It has most of the same negatives as the FlatTrie, but mitigates memory usage issues a bit by using jagged arrays to store the struct nodes instead of large arrays, and each node only stores indirect locations to its parent, next sibling and first child.

This 'indirection' has a cost though. Conceptually its harder to navigate issues in this trie at runtime, as all relationships require lookups into a jagged array and because Nodes don't contain references to all of their children (which makes investigating ordering issues painful). It also takes a performance toll. This trie only performs better than the simple dictionary based tries.

The only idea worth saving from this trie is the 'CacheKey' concept. For tries that back a key's value location by a stable index into a value array (like the Flat and Unsafe tries), a token can be returned with a call that gives a readonly reference to that array index, allowing an API caller to poll for changes to the key's value in a way that is essentially free.

Radix Tree Merging key values has a lot of benefits for memory usage and performance.

It currently has the best performance, likely the best concurrency support (single writer, multiple reader, based on a single atomic reference assignment) and is reasonably easy to diagnose and understand as code (having slightly more complicated logic for creating new nodes due to splitting logic). It does create slightly more garbage than some of the other tries as new nodes are added, but that is a necessary evil due to coordinating with the GC for safe concurrency updates.

This tree is the primary focus of the project currently for a safe and general purpose prefix lookup.

SimpleTrie This trie is bland. It was implemented specifically to be unsurprising and uninteresting and to match the recommendations that can easily be found on blog posts about prefix search data structures.

Its slow to search, uses excessive memory, and has no purpose except to serve as an example of how a naive/simple structure would compare. Its performance characteristics are almost exactly the same as the rmTrie that was added as an external library.

I think I will move the SimpleTrie into the 'Alternatives' project, and I might drop rmTrie. I did not add rmTrie in an attempt to make it look bad, but its comparison with the SimpleTrie is not doing it any favors.

UnsafeTrie A trie that stores its data in unmanaged memory in a compact way. This has some compelling advantages for systems that want to avoid growing the number of objects in the heap ( to keep GC time shorter ) and keep allocation rate and total working set low.

The trie performs fairly well, and has room for a lot of optimizations. The code is difficult to diagnose and debug and the potential for memory leaks is high. Its also a 'Trie' and not a Radix Tree, so its memory usage can still be quite high overall for longer and random keys (using twice the memory of the Radix Tree for relatively short keys with repeated path data).

While it passes the unit tests for single writer, multi reader concurrency, I have concerns about its concurrency approach. This trie should be the second focus of the project. Being able to utilize unmanaged memory while remaining cross platform compatible might make it attractive for unusual targets, like game development. Its approach makes it less compelling for server usage, as its harder to provide guarantees about long term system stability and memory leaks.

keithwill commented 1 year ago

959b9b0039f86864c212e0b543a2263e3a84746b