[NEW] Introduce slot-level memory metrics

This is a continuation of https://github.com/valkey-io/valkey/issues/20, for memory metrics.

The problem/use-case that the feature addresses

Through per-slot memory metrics, Valkey cluster users will be able to re-shard their cluster based on balanced memory usage. Specifically in scaling-in, per-slot memory metrics can be referred to ensure that the out-going slots would "fit" the target shard.

Description of the feature

Tracks per-slot memory metrics.

Alternatives you've considered

The per-slot memory problem can be broken down into two high-level design candidates; 1) Cron-job, and 2) Amortized calculation per-mutation, each with their low-level design candidates listed below.

1. Cron job

Upon user’s request (ex: through a new Valkey command SCAN-MEMORY), initialize a cron-job that scans over Valkey’s per-slot dictionary to calculate per-slot memory consumption.

Pros

[For Option 1A and Option 1C] Less invasive. The scan infrastructure and objectComputeSize() are already implemented. We’ll just have to latch on-top of this.

Cons

Contextually, cron-job does not fit well with the use-cases for the per-slot memory statistics. Consider the user whom needs to scale-in and out immediately. They would be required to first request for a cron-job, wait for it to finish (no upper bound), and the result will be partially stale and thus no longer accurate due to scan’s iterative nature. In summary;

The user needs to wait for the scanning to complete (no upper bound).
The data (especially for those slots iterated earlier) will be stale and less accurate.
[For Option 1B] Invasive changes due to multi-threaded consideration for the existing Scan infrastructure.

The following are cron job’s low-level candidates;

1A. [Cron-job] Main thread.

Similar to how it’s done with serverCron() and many others. Since serverCron() by default consumes 10 cycles per second, fast performance cannot be expected using the main thread.

1B. [Cron-job] Background thread.

Spawns a background thread instead. This will require multi-threaded support on the existing scan infrastructure, which will come at a high implementation complexity and performance overhead due to locking.

1C. [Cron-job] Fork process.

Simply fork the process, which is the fastest and easiest implementation. However, it will incur at worst x2 copy-on-write memory overhead, which is not ideal for users seeking to scale-in / out immediately.

2. Amortized

Calculates the difference in memory usage per-memory-mutation, and thus amortizing its cost. The CLUSTER SLOT-STATS will simply return values from the array where the calculation is already performed.

Pros

Immediate response. Simply return the already computed values.
Fresh / liveness and accuracy of data.

Cons

More invasive. Requires intricate changes to the memory-sparse data-structures (ex: dict.c, quicklist.c ... etc)[Table A1].
[Only for Option 2A] Incurs 8 bytes memory overhead from size_t field per-key, used to track the key-space size.

The following are amortized approach's low-level candidates;

2A. [Amortized] Track each key-space size per-mutation.

At every key-space mutation, track its key-size under each memory-sparse data-structure (ex: dict, quicklist) by a newly introduced field size_t. Then, its difference is aggregated again at per-slot level through hooks such as lookupKey() and afterCommand(). size_t is not necessary for memory-contiguous data-structures (ex: sds, listpack), as their size is already captured in the header.

2B. [Amortized] Use zmalloc intention and windows.

Separate all existing zmalloc calls into three intentions; 1) transient (temporary buffer), 2) internal (for internal book-keeping of Valkey server), and 3) user-data (where user’s data is actually stored). Then, for every zmalloc calls, specify its intention amongst the three options. Only the user-data is accumulated to the global per-slot counter.

2C. [Amortized] Tagging user-defined data-structure.

An improved version of Option 2B. For data-structures, we really just need to differentiate the user-data intention from the rest, by holding a new bit is_user_data under its header (ex: quicklist, sdshdr). If the bit is enabled, all its zmalloc calls will increment / decrement a global per-slot counter.

Appendix

Table A1. Memory sparse and memory contiguous data-structures in Valkey.

memory sparse data-structures	memory contiguous data-structures
dict	listpack
rax	ziplist
quicklist	intset
zskiplist	sds

valkey-io / valkey