valkey-io / valkey

A new project to resume development on the formerly open-source Redis project. We're calling it Valkey, since it's a twist on the key-value datastore.
https://valkey.io
Other
14.6k stars 520 forks source link

Key/Value Embedding in Main Dictionary #394

Open hpatro opened 2 months ago

hpatro commented 2 months ago

Ref: https://github.com/redis/redis/pull/12498 I would like to restart the conversation in Valkey around achieving better memory efficiency for the main dictionary. @vitarb and I had proposed a change in Redis project which encapsulate(s) the key and value within the dictEntry. Please find the details below.

The problem/use-case that the feature addresses

Redis main dictionary is a hash table with separate chaining approach on collision. The entry in the dictionary looks like the following

struct dictEntry {
    void *key;
    union {
        void *val;
        uint64_t u64;
        int64_t s64;
        double d;
    } v;
    struct dictEntry *next; 
};

Each member here is a pointer consuming upto 24 bytes of data which is a high overhead for actual data of small size. For main dictionary, the observation we have that is key is always an sds string and value is a redisObject that holds a pointer to an actual value or contains embedded sds value in case if it's short (44 bytes or less). The overhead caused by the entry can be significantly reduced by changing the struct layout and will lead to more key/value storage in the same amout of memory.

Description of the feature

Introduce a new entry structure which will embedded both the key (sds) and value (redisObject wrapper).

struct embeddedDictEntry {
    struct redisObject robj;
    struct dictEntry *next;
    unsigned char data[];
};

Embedded Dict Entry Layout

Key points

  1. Use data array concept, similar to Redis RAX design to store the key
  2. Promote the Redis object to be the dictEntry (saves a pointer).
  3. Save upto 15 bytes per entry, allows packing 8-45% more key/value pair in the same amount of space based on scenario (jemalloc bucketing).
  4. Able to achieve the above memory savings by maintaining the same latency/throughput compared to unstable branch.
  5. Uses the existing hashtable design with chaining based approach for collisions.

Benchmarking

Server configuration

src/redis-server --daemonize yes --save "" --maxmemory 104857600

Benchmark configuration

src/redis-benchmark -t set -n 100000000 -r 1000000000 -d <value-size>

Results

Key Size (bytes) Value Size (bytes) actual memory used for embeddedDictEntry je_malloc_bin_used for embedded No. of keys (unstable) No. of keys unstable (human) No. of keys (embedded) No. of keys (embedded) (human) % gain (additional key/value stored)
16 3 53 56 1078257 1.07 M 1536113 1.53 M 42.4
16 32 81 96 907804 907 K 983446 983 K 8.3
16 40 89 96 842961 842 K 983446 983 K 16.6
16 44 93 96 842961 842 K 983446 983 K 16.6
16 64 45 48 626508 626 K 737181 737 K 17.6

Drawbacks

Additional information

Previous discussion(s):

lipzhu commented 2 months ago

I am also working on the evaluation for short keys from performance perspective, the idea is simple, just change the layout of dictEntry when zmalloc dictEntry to benefit from cache line. In my local POC implementation, I can observe 4% QPS gain because of the reduced memory latency of dictSdsKeyCompare.

taskset -c 0-3 ~/valkey/src/valkey-server /tmp/valkey.conf

port 9001
bind * -::*
daemonize no
protected-mode no
save ""

taskset -c 6-9 ~/valkey/src/valkey-benchmark -p 9001 -t set -d 100 -r 10000000 -n 10000000 -c 50 --threads 4 -P 20

image

zuiderkwast commented 2 months ago

I like the idea but I have a solution that avoids the drawbacks. It's very much like @madolson's idea in "Rethinking the main dict" idea.

hpatro commented 2 months ago

I am also working on the evaluation for short keys from performance perspective, the idea is simple, just change the layout of dictEntry when zmalloc dictEntry to benefit from cache line. In my local POC implementation, I can observe 4% QPS gain because of the reduced memory latency of dictSdsKeyCompare.

So the PR https://github.com/redis/redis/pull/12498 raised in Redis project has two commits

  1. Key embedding
  2. Key + Value embedding

@lipzhu This is exactly what we have done as part of key embedding.

hpatro commented 2 months ago

I like the idea but I have a solution that avoids the drawbacks. It's very much like @madolson's idea in "Rethinking the main dict" idea.

  • Replace 'dict' with a better hash table for the keyspace. I described it in Re-thinking the main hash table  #169. No more dict entry.
  • Embed key and value in robj as data array, like you do here.

@madolson and I were discussing internally to explore open addressing further with embedding to get rid of the next pointers. @zuiderkwast The part which we were concerned about was how to support SCAN command appropriately. Could you share some more thoughts on "scanning block-by-block until a block with the "ever full" bit zero"?

Key point I feel about this problem space is not in the embedding front rather how do we handle all the layers in the code primarily affected by embedding. https://github.com/redis/redis/pull/12498#issuecomment-1696225457 I would like to rehighlight the Lifecycle/Ownership part here.

zuiderkwast commented 2 months ago

Regarding the hash table, I replied in the other issue.

For embedding, I think it would be very good if we can make robj opaque. Then we can completely reorganize it without the risk of breaking things.

lipzhu commented 1 month ago

So the PR redis/redis#12498 raised in Redis project has two commits

  1. Key embedding
  2. Key + Value embedding

@lipzhu This is exactly what we have done as part of key embedding.

@hpatro I took a look at the https://github.com/redis/redis/pull/12498, my optimization is a little different from yours, I only embedded sds with dictEntry together when they can be put into a cache line. The optimization will not break existing structure, only want to benefit from CPU cache hit rate to reduce the memory latency. From the profiling data I can observe the cycles ratio of dictSdsKeyCompare decreased as expected.

The pseudocode is like below, I can send a normalized patch for you to review if possible.

#define DEFAULT_CACHE_LINE_SIZE 64
static inline dictEntry *createEntryEmbKey(void *key, dictEntry *next) {
    size_t keylen = sdslen((sds)key);
    size_t total_size = keylen + sizeof(dictEntry) + sizeof(struct sdshdr8) + 1;
    /* Embed key with dictEntry when can be filled into a cache line. */
    if (total_size <= DEFAULT_CACHE_LINE_SIZE) {
        dictEntry *entry = zmalloc(total_size);
        struct sdshdr8 *sh = (void *)(entry + 1);
        sh->len = keylen;
        sh->alloc = keylen;
        sh->flags = SDS_TYPE_8;
        memcpy(sh->buf, key, keylen);
        sh->buf[keylen] = '\0';
        entry->key = sh->buf;
        entry->next = next;
        return (dictEntry *)(void *)((uintptr_t)(void *)entry | ENTRY_PTR_EMB_KEY);
    }
    dictEntry *entry = zmalloc(sizeof(*entry));
    entry->key = key;
    entry->next = next;
    return entry;
}
madolson commented 1 month ago

@hpatro @zuiderkwast Do we really need to do key + value embedding together? I feel like we keep grouping them together, but IIRC the key embedding part is actually much more straight forward. I'm feeling a lot better about the pathway forward of using open addressing + embed key into dict entry, while evaluating how to better support value embedding.

zuiderkwast commented 1 month ago

open addressing + embed key into dict entry

With open addressing, there is no dictEntry.

Assuming we're talking about any other place where we can embed stuff, e.g. robj.

I think embedding only the key in robj is more straitforward than key+value, because the value can be updated so that it doesn't fit into the robj allocation anymore. If robj is reallocated, we need update all pointers to it. The key OTOH is fixed size and doesn't change until the key is deleted.

Let's take an incremental approach and let's also consider what more we want to do related to this. Hash table and expire structure.

madolson commented 1 month ago

With open addressing, there is no dictEntry.

Yes, I guess I still get a bit confused, since practically we are converting the dictEntry to be the same as the robj.

Let's take an incremental approach and let's also consider what more we want to do related to this. Hash table and expire structure.

I like this incremental approach so far. I worry that it might lead us to a place where we need to unwind more code and implement it again. I was documenting my long term vision in the other issue:

https://github.com/valkey-io/valkey/issues/169#issuecomment-2106368024

hpatro commented 1 month ago

@hpatro @zuiderkwast Do we really need to do key + value embedding together? I feel like we keep grouping them together, but IIRC the key embedding part is actually much more straight forward. I'm feeling a lot better about the pathway forward of using open addressing + embed key into dict entry, while evaluating how to better support value embedding.

Yeah I had discussed with @vitarb about the same. We can easily gain around 7 bytes per key (IIRC) with the approach we had proposed and the code is pretty straightforward. It doesn't introduce much complexity while embedded into dictEntry. It can be done for 8.0.

Getting rid of dictEntry needs more prototyping and thought process which might arrive in Valkey 9 maybe?

zuiderkwast commented 1 month ago

Adding this complexity and coupling between dict and robj, if we later want to undo it, is not worth 7 bytes IMO. There are quite a lot of changes in that Redis PR. It's better that we focus on what we want in the longer term.

Let's consider also some ongoing work where an robj is pinned to an I/O thread so that it can be freed by the same thread that allocated it, which maybe makes this embedding of key and value more complex.

To avoid the problem that we need to reallocate an robj when the value changes, we can embed the value only if it fits in the allocation's usable size. If it grows (e.g. by the APPEND command), we move it to a separate allocation.

hpatro commented 1 month ago

@zuiderkwast We've made changes in the past which can benefit the users in a short term and have redone the implementation in the following major/minor version. key embedding in dictEntry is a pretty small change and we can easily get rid of it with dictEntry removal. I feel the small gain is worth it with this minimal change (7 bytes) for Valkey 8.0 and invest on the kvpair object structure with open addressing in 9.0. I've submitted the PR over here, we can discuss further.