[Feature request] Ability to accurately predict database size

Description: Currently, Valkey provides a dataset size calculation that estimates the total used memory minus the internal server struct sizes. However, this calculation does not account for internal fragmentation, leading to an inaccurate representation of the true memory footprint. This feature request proposes enhancing the database size estimation to include internal fragmentation and other relevant factors, providing a more accurate prediction of the overall memory and disk space requirements.

Use Cases:

Dynamic Replication Buffer Size Allocation: With a reliable estimate of the database size, including internal fragmentation, Valkey could dynamically allocate an appropriate replication buffer size for dual-channel replication sync. This would optimize memory usage and prevent over-allocation or under-allocation of the replication buffer, ensuring efficient resource utilization.
Disk Space Verification and Capacity Planning: Knowing the expected database size, including expired data structures, pubsub channels, and other relevant factors, would allow Valkey to verify if there is enough disk space available for persistence operations (e.g., RDB snapshots). This could prevent potential issues caused by running out of disk space during save operations. Additionally, having an accurate estimate of the database size would aid in capacity planning, monitoring, and resource management.
Slot Management and Migration: Providing slot-level memory usage statistics would help manage slots and migrate them accordingly, enabling better load balancing and cluster management.

Proposed Implementation:

Maintain appropriate counters that track the size of data structures (e.g., key-value pairs, hashes, sets, etc.) and metadata on each write, delete, or update command.
Introduce additional metrics to capture the size of expired data structures, pubsub channels, functions, and other components that may impact the overall database size.
Provide slot-level memory usage statistics to enable better slot management and migration.
Provide an auxiliary header in the RDB file to store the predicted database size, which could aid in dual-channel replication buffer allocation.
Expose the predicted database size as an info field or command, allowing users and external tools to query and utilize this information for disk space verification, capacity planning, and resource management. 5.1. Implement safeguards to prevent initiating save operations to disk if there is a high probability of failure due to insufficient disk space, based on the predicted database size.

Benefits:

Improved memory management and resource utilization for dual-channel replication sync, based on a more accurate estimation of the database size.
Proactive disk space management and prevention of potential issues during persistence operations by verifying available disk space against the predicted database size.
Improved slot management and load balancing in clustered environments, facilitated by slot-level memory usage statistics.
Increased transparency and control over the database size and resource utilization, enabling better decision-making and resource allocation.

Additional Considerations:

Performance impact: The enhanced estimation process should be designed to minimize the performance impact.
Accuracy: The estimation should aim for a high level of accuracy, considering various factors that could affect the final database size, while ensuring that the reported size is never smaller than the actual size. The overestimation should be minimized to prevent unnecessary over-provisioning of resources.

Currently, Valkey does not provide a way to reliably estimate the size of the entire database.

@naglera I think we should carefully define the issues in the current implementation as well as scope the requirements. For example today we have the dataset size calculation which is a raw estimation to the dataset size. Can you please explain what are the current issues with this calculation?

Dynamic Replication Buffer Size Allocation: With a reliable estimate of the database size, Valkey could dynamically allocate an appropriate replication buffer size for dual-channel replication sync. This would optimize memory usage and prevent over-allocation or under-allocation of the replication buffer.

I agree Replication local buffer size estimation can benefit from knowing the dataset size. when the memory footprint is known we can set the local replication buffer accordingly to make sure we will have enough memory to load the RDB.

Disk Space Verification and Capacity Planning: Knowing the expected database size would allow Valkey to verify if there is enough disk space available for persistence operations (e.g., RDB snapshots). This could prevent potential issues caused by running out of disk space. Additionally, having an accurate estimate of the database size would aid in capacity planning, monitoring, and resource management.

Diskspace is impacted by RDB size while datasetsize id impacted by the in memory utilization. IMO there might be large difference between them. for example IIRC the key itself will take at least double the memory in case it has an expiry time in memory (DB+Expire) but only only once inside the RDB file.

Slot Management and Migration: Knowing the size of every slot would help manage slots and migrate them accordingly, enabling better load balancing and cluster management.

I agree that every slot memory size is important but I think it is on the roadmap somewhere in CLUSTER SLOT-STATS.

Maintain appropriate counters that track the size of data structures (e.g., key-value pairs, hashes, sets, etc.) and metadata on each write, delete, or update command.

as stated before. this might be a good idea but it might not be so trivial to implement and maintain. I would ask again how bad is the current dataset size evaluation is that problematic (I can think of few things like fragmented memory as imposing issues, but I am not sure they provide explicit blockers)

The estimated database size could be returned to the user or used internally by Valkey for dynamic buffer allocation, disk space verification, or slot management and migration.

I think the requirement should also include introducing a new aux header in RDB. this way the dual channel implementation could modify the buffer based on the value read from the RDB.

Proactive disk space management and prevention of potential issues during persistence operations.

What did you have in mind here? warning the user of potential full sync predicted failure? I think this still might be up to external management side-cart to decide based on query statistics.

Accuracy: The estimation should aim for a high level of accuracy, potentially considering various factors that could affect the final database size.

maybe we should be more detailed here? like: "it is O.K to have a reported size larger than the "real" dataset size but not smaller..."

I think we should carefully define the issues in the current implementation as well as scope the requirements. For example today we have the dataset size calculation which is a raw estimation to the dataset size. Can you please explain what are the current issues with this calculation?
- The main issue with the current dataset size calculation is that it does not take into account the internal fragmentation. The dataset size is the total used memory minus the internal server struct sizes, but it fails to accurately reflect the true memory footprint due to the internal fragmentation factor.

I agree Replication local buffer size estimation can benefit from knowing the dataset size. when the memory footprint is known we can set the local replication buffer accordingly to make sure we will have enough memory to load the RDB. However the requirement here is to
- The requirement here is to allow the replica to use all its available memory for buffering replication changes while ensuring there is still enough memory available to store the incoming RDB. For example, if a replica has 10GB and the snapshot size is 9GB, we would like the replica to buffer a maximum of 1GB of replication changes.
Diskspace is impacted by RDB size while datasetsize id impacted by the in memory utilization. IMO there might be large difference between them. for example IIRC the key itself will take at least double the memory in case it has an expiry time in memory (DB+Expire) but only only once inside the RDB file.
- I agree. In order to cover all use cases, we will need a few variations for that metric. We should consider expired hash size as well as pubsub channels, functions, etc.
I agree that every slot memory size is important but I think it is on the roadmap somewhere in CLUSTER SLOT-STATS.
- If I recall correctly, implementing the memory-bytes metric per slot was planned and is likely on the roadmap. @kyle-yh-kim, could you please share the current status of that feature.
as stated before. this might be a good idea but it might not be so trivial to implement and maintain. I would ask again how bad is the current dataset size evaluation is that problematic (I can think of few things like fragmented memory as imposing issues, but I am not sure they provide explicit blockers).
- You're correct that fragmentation is the primary issue with the current dataset size metric. When creating a new replica, we generally expect it to have minimal fragmentation. However, in scenarios where the master's memory is highly fragmented, using the current dataset size calculation may result in an undesirably low limit for the replica's buffer size. This situation can exacerbate the problem, as the master will then have to handle a larger portion of the replication buffer load, despite potentially being in a suboptimal state due to the fragmentation.
I think the requirement should also include introducing a new aux header in RDB. this way the dual channel implementation could modify the buffer based on the value read from the RDB.
- While introducing a new aux header in the RDB file could be beneficial for the dual-channel implementation, I believe we should approach this feature request from a broader perspective. The ability to accurately predict the database size has multiple use cases beyond just the dual-channel scenario. Instead of limiting the scope to dual-channel, we should consider the following: a. Including an aux header in the RDB file could indeed be useful for the dual-channel implementation, but it should not be the sole focus of this feature. b. Providing slot-level statistics, including memory usage, would be invaluable for slot migration and cluster management. c. Exposing the predicted database size as an info field would enable disk space verification and capacity planning across various scenarios. By taking a more comprehensive approach, we can ensure that this feature addresses multiple requirements and provides a robust solution.
What did you have in mind here? warning the user of potential full sync predicted failure? I think this still might be up to external management side-cart to decide based on query statistics.
- Ideally, we would want to avoid initiating a save process to disk if there is a high probability that it will fail due to insufficient disk space. By accurately predicting the required disk space for the save operation, we could proactively prevent scenarios where the save process starts but fails midway due to running out of disk space.
maybe we should be more detailed here? like: "it is O.K to have a reported size larger than the "real" dataset size but not smaller..."
- You make a valid point. We should aim to have the upper limit of the reported database size as accurate as possible. While it is acceptable for the reported size to be slightly larger than the actual dataset size, we should strive to minimize the overestimation. Having a tighter upper limit would ensure more precise resource allocation and prevent unnecessary over-provisioning of resources based on an inflated size estimate.

If I recall correctly, implementing the memory-bytes metric per slot was planned and is likely on the roadmap. @kyle-yh-kim, could you please share the current status of that feature.

Hey Amit, sorry for the delayed response. Yes - per-slot memory metrics is scheduled for merger by Valkey 8.2 release. As for the current status of the feature, I left the latest design proposal here; https://github.com/valkey-io/valkey/issues/852, now waiting for the core team's feedback. Once the design alignment is met, we will begin its implementation.

valkey-io / valkey

[Feature request] Ability to accurately predict database size #940