valkey-io / valkey

A flexible distributed key-value datastore that is optimized for caching and other realtime workloads.
https://valkey.io
Other
17.06k stars 636 forks source link

[NEW] New data structure for TTL #990

Open zuiderkwast opened 1 month ago

zuiderkwast commented 1 month ago

Originally posted by @madolson in https://github.com/valkey-io/valkey/issues/169#issuecomment-2106368024

For the TTLs, we can pointers to objects with similar TTLs in segments in a RAX. The actual TTL is in the robj itself.

I'm not convinced we want a RAX, I was thinking about something more just like a B tree that was more purpose built for TTLs. My thought was each node in the tree would have four pointers to different sub-elements. The sub-elements are either TTL segments (255 pointers to Valkey objects + metadata) or more nodes. We split nodes as they fill up. Once the nodes get to their max granularity (48 bits of TTL precision) we start chaining segments together.

Segments are limited to 255 items, so that we can keep a reference to them in the valkey object (48 bit pointer + 8 bits to lookup into the segment) so that we can find the element. We have 8 extra bits for future use. When we insert more items into the same segment, we may eventually need to split them, so this bounds the overhead spent rehashing these segments.

We are always keeping track of the node and segment with the nearest expiration, so we can semi-efficiently scan for items that are in the nearest segment. The more items in Valkey, the tighter the bound over the TTL as the segments will get more precise.

Volatile TTL will work well, as we will have the nearest buck to evict from. Volatile LRU and volatile random will work less well. We might consider just picking a random segment and evicting a random item or the item with the worst LRU.

My slightly more detailed idea:

Untitled-2024-05-12-1052

madolson commented 1 month ago

I also think it's worth comparing this against trying to just trying to simplify the expire dictionary. If we embed the TTL and key directly into the robj, we can have the expire dictionary become a hashset of all the items with an expiration. The hash would still be computed by the key.

It's the same number of pointer dereferences to get to the TTL.

zuiderkwast commented 1 month ago

Exactly

Expire is a kvstore just like keys, so we change kvstore to use the swede table (sweet table?) at once so we don't have to duplicate kvstore.

ranshid commented 1 month ago

I would like to raise an alternative option to embedding the TTL in robj. This is mainly in order to serve a solution for https://github.com/valkey-io/valkey/issues/640.

At high level, Since general items are usually sds we can benefit from keeping an optional TTL embedded inside sds instead of robj.

I also think it's worth comparing this against trying to just trying to simplify the expire dictionary. If we embed the TTL and key directly into the robj, we can have the expire dictionary become a hashset of all the items with an expiration. The hash would still be computed by the key.

@madolson I agree that optimizing the expiry algorithm to iterate sorted items might not be that important for active expiry implementation. I was thinking of using some compressed bitmap implementation as a potential candidate for marking hash buckets/chunks with items holding TTL but still at a very early stage of investigations.

zuiderkwast commented 1 month ago

Yes Ran, this is possible. I'll make sure the abstractions are made in a way that allows changing this part.

The first big thing is to replace the dict with a hashset of "valkey" objects. (A valkey object is an robj with a key, either embedded or at least with a key pointer embedded.) I would prefer that we don't do this TTL-in-sds with the current dict, because it would give me a mega-conflict. :)

ranshid commented 1 month ago

When we insert more items into the same segment, we may eventually need to split them, so this bounds the overhead spent rehashing these segments.

@madolson / @zuiderkwast just to circle to the top suggestion and clarify your point here: You meant when items are deleted OR added we will also have to update all the nodes which share the node right? so that would be updating at least d keys in the main dictionary to point to a new segment (given d=127 (128 degree -1))?

The sub-elements are either TTL segments (255 pointers to Valkey objects + metadata) or more nodes.

I am not sure if you meant that the inner nodes in the B-tree are different than the leaf nodes? I think that maybe in order to minimize the main dictionary updates we can use B+ tree so that the inner nodes will only keep 8 bytes TTL bytes and split/merges will not require main dictionary updates when internal nodes are split or merged.

zuiderkwast commented 1 month ago

Interesting. I'm not that familiar with B+ trees and where you got 42 bits from. I'd like to know more about it.

ranshid commented 1 month ago

Interesting. I'm not that familiar with B+ trees and where you got 42 bits from. I'd like to know more about it.

Yeh At some point I thought we might use less bits (like we did in LRU) and keep some reference time when the server starts. But that can be discussed at another time (currently changes the comment before to mark 8 bytes)