private-octopus / picoquic

Minimal implementation of the QUIC protocol
MIT License
527 stars 156 forks source link

Hash table implementation requires lots of tiny malloc #1531

Closed huitema closed 1 year ago

huitema commented 1 year ago

The current implementation of picohash follows the "separate chaining" method. The table is defined as an array of "bins". For each bin, there is a chain of "hash item" structures:

typedef struct _picohash_item {
    uint64_t hash;
    struct _picohash_item* next_in_bin;
    const void* key;
} picohash_item;

This method itself has pros and cons. The pro is that performance degrade only slowly when the table becomes more subscribed, eventually falling to O(N) behavior. This is somewhat safer than the "open addressing" methods such as linear probing, which degrades drastically when the table becomes nearly completely subscribed. However, we have evolved usage of hash tables in picoquic to require a "target number of items", which means that tables will not be oversubscribed. The drawback is that tables using separate chaining have worse memory locality than with open addressing method -- at this point, moving to open addressing would be beneficial.

The implementation itself however has an issue: the items are allocated through a malloc call when the entry is created. This may not be a performance item, but it is a failure point: this malloc call may fail, and then the application has to implement mitigations.

We should consider one of two improvements:

huitema commented 1 year ago

The current code uses the hash table in several locations:

The structure "hash_item" is only used inside picohash functions, in picohash.c or picohasttest.c, except for the function cidset_iterate used in log, which performs a specified callback for all elements of the hash table. That function could be rewritten, or replaced by a generic iterator.

huitema commented 1 year ago

After doing some experimentation, it seems that there are two big issues with a "linear probing" implementation of open chaining: setting the size of the hash table, and reacting to hash collisions.

The open chaining implementation requires either that the application correctly predicts the size of the hash table and its load factor, or that the hash table implementation supports dynamic resizing. Implementing resizing would defeat the original concern of reducing the impact of malloc: the resizing malloc will happen as the usage grows beyond expectation, and is thus likely to fail and introduce new failure modes in the application. But experience shows that predicting the right size can be hard.

The hash collision issue is subtle. When an entry is deleted, the algorithm requires "compacting holes" by moving back the next entry with a hash lower than the current one to the current one. However, the hash space is circular, not linear. To assess that a hash is "lower" than the previous one, nodes must perform a modulo operation. The entry is "lower" if the difference is smaller than half the modulo. This means that when doing insertion, the code should not look for holes further than half a modulo from the current value. The probability of failure is thus F**(N/2), where F is the load factor. If N is small, like maybe 4 or 8, and the load factor is 50%, this yields a failure probability of 25% or 6.25%. This means random failure, again creating a new failure mode for the application.

Overall, the current implementation appears more reliable. If we want to reduce the frequency of malloc, we should consider the other option, "insert the "hash_item" structure in the object being hashed", so that there is just one memory allocation per object, not one per object insertion in a hash table.

huitema commented 1 year ago

Analyzing the idea of inserting the "hash item" into the data structure. There seems to be impact on:

The potential for confusion is large, so it would make sense to have a picohash_create_ex that specifies whether the item memory is part of key or not. Maybe pass a function that provides the item as part of the key. If the function is present:

huitema commented 1 year ago

The embedding idea works, as shown in PR #1534. The hash table of CID was redefined to use the embedded approach, and this saves 2 malloc calls per CID, while making the code simpler. The table of ICID got the same treatment: the registration of ICID and addresss has the same lifetime as the connection context, thus can be embedded in the connection context. It is tempting to follow the same approach for the other tables, but each one has its one particularities:

table follow up
remember SSL resume tickets This table is used once per ticket, which means write often, read rarely. Updated, but consider using a splay
track connection by network address Added code to only populate the table if the CID length is null. Updated, register at most one address per path. May consider replacing by splay.
track connection by reset secret write many, use rarely. Updated. Consider a rewrite using a splay
table of CID for logging multipath packets only used to track CIDs in the picolog program. Can stay as is.
huitema commented 1 year ago

PR #1534 solved the issue, by removing all the tiny malloc allocations tied to hash tables. The code is thus smaller, somewhat faster, and certainly more robust.