Open zuiderkwast opened 7 months ago
@hpatro in another issue:
@madolson and I were discussing internally to explore open addressing further with embedding to get rid of the next pointers. @zuiderkwast The part which we were concerned about was how to support
SCAN
command appropriately. Could you share some more thoughts on "scanning block-by-block until a block with the "ever full" bit zero"?
Yes I can.
When we scan, we scan one bucket at a time and the cursor corresponds to the bucket number.
The "ever full" bit is like a shared tombstone and it is the condition to whether probing the next block is necessary. So if it's zero, it means that no objects that hash to bucket N are instead inserted in bucket N+1 or later, due to probing. In this case, we can stop if we want. But if the "ever full" bit is one, some keys that should belong to N may have been inserted in N+1, N+2, etc. so to cover these when we scan bucket N, we must to continue scanning N+1, N+2, etc. until we reach a bucket with "ever full" zero.
Statistically, most of the buckets will have "ever full" set to zero. Otherwise it's probably too full and we'll rehash it anyway. (Fill factor to be explored a bit.)
I want to implement this as soon as I get the chance...
The video linked in the issue is a good one. I recommend it to get a feel for the power of this and similar designs. (It doesn't cover scan though.)
I just watched the video, (it's been in my watch list forever now) and that implementation does look very promising. I also pretty clearly see the path forward for the odd cases (SCAN and random/eviction). The implementation does have some overlap with key/value embedding, as we want to save an extra 8 bytes and avoid a memory look up on the write path. I know @hpatro was doing some work to investigate
For the TTLs, we can pointers to objects with similar TTLs in segments in a RAX. The actual TTL is in the robj itself.
🏓 The TTL topic and this idea moved to #990. 🏓
I'm not convinced we want a RAX, I was thinking about something more just like a B tree that was more purpose built for TTLs. My thought was each node in the tree would have four pointers to different sub-elements. The sub-elements are either TTL segments (255 pointers to Valkey objects + metadata) or more nodes. We split nodes as they fill up. Once the nodes get to their max granularity (48 bits of TTL precision) we start chaining segments together.
Segments are limited to 255 items, so that we can keep a reference to them in the valkey object (48 bit pointer + 8 bits to lookup into the segment) so that we can find the element. We have 8 extra bits for future use. When we insert more items into the same segment, we may eventually need to split them, so this bounds the overhead spent rehashing these segments.
We are always keeping track of the node and segment with the nearest expiration, so we can semi-efficiently scan for items that are in the nearest segment. The more items in Valkey, the tighter the bound over the TTL as the segments will get more precise.
Volatile TTL will work well, as we will have the nearest buck to evict from. Volatile LRU and volatile random will work less well. We might consider just picking a random segment and evicting a random item or the item with the worst LRU.
My slightly more detailed idea:
Interesting. Do you have a B-tree implementation we can pick?
48-bit pointers can work on amd64 and arm64, but z/Architecture uses full 64 bit pointers IIUC. And what about 32-bit mode?
This sounds like a very good idea. So each TTL segment is a range of min-TTL and max-TLL and the keys in the segment are unordered?
I don't think we need incremental rehashing of the segments, do you? It's limited to 255 items and should be fast.
Storing both TTL and TTL-segment for each volatile key is a bit much. I'm wondering if we can live with only storing the TTL in the key-struct. We'd save 8 bytes per (volatile) key. We can find it using the B-tree instead. We don't need to find it for GET if it's not expired. We only need to find it to delete or set the TTL.
The different structs "Keyval", "Embedded key" and "TTL key" can be combined in all combinations? Even embedded value? Here's an idea:
struct valkey {
unsigned type : 4;
unsigned encoding : 4;
unsigned lru : 24;
unsigned hasttl : 1;
unsigned embkey : 1;
unsigned embvalue : 1;
int refcount : 29;
char data[];
/* Data:
* - long long expire, if hasttl
* - embedded key if embkey, otherwise pointer to key
* - embedded value if embvalue, otherwise pointer to value */
};
To avoid reallocating the struct when the value changes, we can use embedded value only if it fits in the zmalloc-usable-size, otherwise a pointer. In this way, we never need to reallocate the struct (apart from during defrag).
Example: If an embedded string is stored as an sds-string plus one byte (explained in some PR about embedded key and value) then an embedded short string has 2 bytes overhead. The struct valkey
has 8 bytes overhead. Assuming a key name of length 16, we'll be in a 48 bytes jemalloc bin, meaning can store up to 20 bytes of embedded value (8 + 2 + 16 + 2 + 20 = 48). If it grows beyond that, we move it to its own allocation and replace the value with a pointer.
Storing both TTL and TTL-segment for each volatile key is a bit much. I'm wondering if we can live with only storing the TTL in the key-struct. We'd save 8 bytes per (volatile) key. We can find it using the B-tree instead. We don't need to find it for GET if it's not expired. We only need to find it to delete or set the TTL.
A btree search is just expensive and will thrash L3 cache with all the random pages we're loading on normal lazy-expires. Maybe we need an implementation and test the performance to rule that out. I guess we have an open question of if caching workloads primary hit the lazy expiration deletion path or the active deletion path. We also need to compare that against random TTL updates, how often are folks arbitrarily updating the TTL. If we optimize just for active expiry our problem is much easier.
On a side note, maybe keeping order is just not worth it. We have two pointers we can save today (the next pointer and the key pointer) from the the expire dictionary if we have the hash tables from the expire dictionary point directly to the robj as well. At that point we're just paying 8 (TTL in the robj) + ~16 (overhead from the HT with a 50% load factor, which will be lower with the swiss table).
struct valkey {
Hehe, I like calling this a valkey.
Interesting. Do you have a B-tree implementation we can pick?
I did some brief research and didn't see a clear winner. I was maybe going to try initially with just an off the shelf c++ red/black tree to check performance first. I feel like if that doesn't work the whole endeavor is moot. I'll look into this a bit more after rc1 launches.
This is an idea of the hash table approach I'd like to use for the main key space.
The picture is taken from https://www.youtube.com/watch?v=ncHmEUmJZf4 at 48:57.
It's an open addressing hash table, consisting of blocks of 64 bytes (one cache line). Each block has space for 7 keys (no values, so it's semantically a set), as shown in the picture. For each key there are 8 hash bits, to quickly eliminate mismatching keys. In addition to that, there's one "presence" bit to indicate whether each of the elements exists or is an empty slot. Finally one "ever full" bit to indicate if the block has ever been full. This is like a shared tombstone and it is the condition to whether probing the next block is necessary. There is no internal order among these 7 keys. A new element can be inserted in the first free slot within the block.
SCAN can work by scanning block-by-block until a block with the "ever full" bit zero.
Incremental rehashing can work the same way as the classic dict, i.e. have two tables allocated during rehash.
The element I want to put in the hash table is a pointer to an object (
robj
) which is should contain all fields like key, value, TTL, LRU, in a compact way, with or without an embedded key.Splitting this into tasks:
Regarding changing how TTLs are stored, it's not necessary to do in the scope of repacing the hash table implementation. I'm moving it to a separate issue: #990.