valkey-io / valkey

A flexible distributed key-value datastore that is optimized for caching and other realtime workloads.
https://valkey.io
Other
17.26k stars 652 forks source link

[NEW] Introduce slot-level memory metrics #852

Open kyle-yh-kim opened 3 months ago

kyle-yh-kim commented 3 months ago

This is a continuation of https://github.com/valkey-io/valkey/issues/20, for memory metrics.

The problem/use-case that the feature addresses

Through per-slot memory metrics, Valkey cluster users will be able to re-shard their cluster based on balanced memory usage. Specifically in scaling-in, per-slot memory metrics can be referred to ensure that the out-going slots would "fit" the target shard.

Description of the feature

Tracks per-slot memory metrics.


Alternatives you've considered

The per-slot memory problem can be broken down into two high-level design candidates; 1) Cron-job, and 2) Amortized calculation per-mutation, each with their low-level design candidates listed below.

1. Cron job

Upon user’s request (ex: through a new Valkey command SCAN-MEMORY), initialize a cron-job that scans over Valkey’s per-slot dictionary to calculate per-slot memory consumption.

Pros

Cons

Contextually, cron-job does not fit well with the use-cases for the per-slot memory statistics. Consider the user whom needs to scale-in and out immediately. They would be required to first request for a cron-job, wait for it to finish (no upper bound), and the result will be partially stale and thus no longer accurate due to scan’s iterative nature. In summary;

The following are cron job’s low-level candidates;

1A. [Cron-job] Main thread.

Similar to how it’s done with serverCron() and many others. Since serverCron() by default consumes 10 cycles per second, fast performance cannot be expected using the main thread.

1B. [Cron-job] Background thread.

Spawns a background thread instead. This will require multi-threaded support on the existing scan infrastructure, which will come at a high implementation complexity and performance overhead due to locking.

1C. [Cron-job] Fork process.

Simply fork the process, which is the fastest and easiest implementation. However, it will incur at worst x2 copy-on-write memory overhead, which is not ideal for users seeking to scale-in / out immediately.

2. Amortized

Calculates the difference in memory usage per-memory-mutation, and thus amortizing its cost. The CLUSTER SLOT-STATS will simply return values from the array where the calculation is already performed.

Pros

Cons

The following are amortized approach's low-level candidates;

2A. [Amortized] Track each key-space size per-mutation.

At every key-space mutation, track its key-size under each memory-sparse data-structure (ex: dict, quicklist) by a newly introduced field size_t. Then, its difference is aggregated again at per-slot level through hooks such as lookupKey() and afterCommand(). size_t is not necessary for memory-contiguous data-structures (ex: sds, listpack), as their size is already captured in the header.

2B. [Amortized] Use zmalloc intention and windows.

Separate all existing zmalloc calls into three intentions; 1) transient (temporary buffer), 2) internal (for internal book-keeping of Valkey server), and 3) user-data (where user’s data is actually stored). Then, for every zmalloc calls, specify its intention amongst the three options. Only the user-data is accumulated to the global per-slot counter.

2C. [Amortized] Tagging user-defined data-structure.

An improved version of Option 2B. For data-structures, we really just need to differentiate the user-data intention from the rest, by holding a new bit is_user_data under its header (ex: quicklist, sdshdr). If the bit is enabled, all its zmalloc calls will increment / decrement a global per-slot counter.

Appendix

Table A1. Memory sparse and memory contiguous data-structures in Valkey.

memory sparse data-structures memory contiguous data-structures
dict listpack
rax ziplist
quicklist intset
zskiplist sds
madolson commented 1 month ago

@zvi-code I believe you wanted to comment on this.