Performance Suggestions and Improvements

keithwill / TrieHard

A playground for building an efficient key and key prefix lookup collection.

MIT License

12 stars 1 forks source link

Performance Suggestions and Improvements #5

Closed keithwill closed 11 months ago

keithwill commented 1 year ago

This is an open issue for discussion about approaches to improve performance of the library and to document offline conversation details.

keithwill commented 1 year ago

User DBMZA in a Reddit post discussing TrieHard performance made suggestions for general performance improvements. To paraphrase:

Liberal use of unchecked and [SkipLocalsInit]
Think carefully about how local variables will be enregistered by the processor and use local static methods with [AggressiveInliningAttribute] in cases where it is likely that the local variables will not fit in the register. They mentioned a heuristic of about 8 references, pointers, or counters (or more for AV2 vectors).
Don't be afraid of using up to ~31kb of stackalloc (or whatever will fit into L1 cache) when needed
Paradoxical relationship between small methods, inlining and register spill (small methods perform better, but only when inlined, and if they are inlined they can cause register spill where they are used by the caller).

Out of these, I think understanding unchecked could be the most helpful if it can impact the costs of array bounds checking. The register advice and advice on inlining is helpful, but might be difficult to take advantage of. I've used SkipLocalsInit in the unsafe trie, but maybe I should experiment with its use in other areas as well.

keithwill commented 1 year ago

Currently the library makes heavy use of ArrayPoolList.cs to pool search results. It might be worth investigating if an approach using sequences of memory could be used instead of a list backed by (large) pooled arrays. For very large search results the pooled arrays require a lot of copying, and the shared array pool instance we are using might not even accept the large arrays back into the pool. Even if they are allowed to return to the pool, they would reside in LOH which could cause memory fragmentation issues and would be backloading an expensive GC the first time the GC decides it needs to do a compacting G2 collection.

The unsafe trie can avoid some of these issues by using custom enumerators that walk the graph as search results are returned, but that approach is fairly complicated, and does seem to impact enumeration performance.

A hybrid approach that uses ArrayPoolList instances up to some number of records and then swaps to using multiple array pool instances might make sense, but if we are willing to eat the cost of indirection when indexing the search results, it might make sense to implement a custom buffer, or investigate if some class library conventions already exist for this (perhaps from System.Buffers or an approach similar to what Pipelines does).

keithwill commented 1 year ago

I had a brief chat with Tanner Gooding on the C# Discord, and two of his comments seemed relevant to this project. Many memory allocators use LinkedLists of arrays, where each index contains an array that is a multiple in size of the previous (e.g. starting at 4 and then doubling). That jogged my memory about a slideshow made by Yoshifumi Kawai (author of a number of high performance serializers) who mentioned using inline array structs instead of a linked list to contain the arrays and pooling the struct. That might be a better approach for the Flat Trie or for the Unsafe Trie for storing values, and definitely would be helpful for the SearchResult.cs implementation (which currently uses array pools that have to be rented and returned for each resize and reset of the wrapping ArrayPoolList).

Tanner also mentioned that storing Vectors in a class directly is a bad idea, and that they should be loaded from an array when needed. He said they were an abstraction/metaphor over top of enregistered variables. I had asked him casually if keeping my search bytes packed in a Vector was a bad idea (and he was clear about that being a bad idea). He mentioned that the performance of loading Vectors from an array should have the same performance as loading from a pointer, which is nice to know if I attempt this approach with the Flat Trie.

keithwill commented 1 year ago

I implemented a ReusableBuffer to experiment with as a replacement for the ArrayPoolList for backing search results. I tried many variations of it, but I ultimately did not find the concept compelling for TrieHard. With doubling buffers that don't redundantly copy to the next size buffer, array indexing and enumeration was slower. Handling the jagged array indexing is not free. When I modified the ReusableBuffer to copy every time the array doubles (so that only the most recent array used needs to be enumerated), the performance only matched the ArrayPoolList, maybe a few ns faster when used as a ref struct (which doesn't allow boxing to use LINQ on).

I'm sure there are ways to improve the buffering of search results, but this particular approach didn't garner useful benefits.

keithwill commented 1 year ago

SkipLocalsInitAttribute did not improve performance on the RadixTree where stackalloc was being used (to convert strings to UTF8). In a few cases its application (at the class level) even seemed to reduce performance in benchmarks, which is a bit confusing. Based on the description of SkipLocalsInit, this should not be the case. I'm wondering if the JIT already has optimizations to know when its safe to skip stackalloc initializations (zeroing).

keithwill commented 1 year ago

With the RadixTree, I tried out various approaches for dealing with the costs of the dynamic storage related to search operations:

Using inline arrays that would become pooled arrays if they grew over a certain size (for storing node transversal stack information)
Adding parent properties to each node so that only the index of the parent would need to be stored in stack information (allowing the 'stack' to be stored in either a stackalloc byte array or a pooled byte array).
Giving up on IEnumerable, and adding a new method that returned a struct that would let a caller page through a buffer of results per call (requiring the caller to pass in a buffer, similar to some file or network methods).

These approaches still did not perform very well, and I was unsatisfied with them. While they did not require buffering the entire set of results at a time, the normal pool back search using recursive collection still had better performance.

Changing the RadixTree nodes to also store their index in their parent offered the ability to enumerate without any intermediate stacks or buffers (each MoveNext grabbing the next kvp/value). I thought method call overhead would cause this approach to perform poorly, but it outperformed all other approaches, including the array pool backed recursive approach.