[NEW] Data Tiering - Valkey as general data store not just in-memory data store

selemuse commented 5 months ago

Now that Valkey is beginning to set its own course, wonder if Valkey can be enhanced to work as general data store, rather than just in-memory data store? It will widen the use-case for Valkey and remove those memory constraint.

sanel commented 5 months ago

I'm not sure if my vote counts, but why, may I ask? Valkey/Redis is good at what it does, and we have plenty of battle-tested KV stores out there (e.g., HBase) that have stood the test of time. However, if the goal of this is to improve snapshotting to be "less visible" and more streamlined like in real databases, I'm for it :)

mattsta commented 5 months ago

If you truly want One System to Rule them All, just learn clickhouse and be at peace forever. Clickhouse has turned into basically what I wish redis would have been if it had proper management.

if anything, this project should try to have a more narrow focus before trying to grow again. feature deprecation would help more than trying to add more complexity. every new feature will have fewer users and take more time to implement and maintain.

"but what about backwards compatibility!" some people yell. What about it? Versions don't stop existing. Just run old versions if you need old versions for legacy systems. You'd be surprised how much industrial equipment still runs on Windows 3.1 out there.

I think what people actually want is a core "lightweight + high performance data management platform" they can extend for multiple use cases, but the current architecture mixes all concerns together. The unlimited mixing of concerns has become unmanageable. Every part of the system from networking, server, data structures, protocols, storage, replication, clustering are all mixed together where you can't do a major refactor to one component without touching almost all the other components too (which then breaks compatibility, so nobody does it).

Once nice thing about a non-profit foundation model is there's no push to constantly "grow" or "expand" or "capture market share." The project can just grind in the dark to be best.

[edit: also random idea after writing this: would it make sense to organize a new responsibility structure where there's a per-feature "leader" single person responsible for directing each logical component of networking, server, data structures, protocols, storage, replication, clustering, etc? The project has always had mainly an "everybody knows everything" organization which is great for micromanaging authority, but not the best for long term detailed feature growth in different areas or running concurrent development cycles over the long term.]

kam193 commented 5 months ago

I might misunderstand the idea of the issue's author, but what I understand under 'not just in-memory' is an ability to avoid data loss on eviction in critical situations when the real memory usage is higher than expected. I don't think it would be above Valkey scope ;)

But this is just an example important to me. I'd refer to three more general-purpose on-disk features from the world around Valkey:

https://docs.keydb.dev/docs/flash - KeyDB on Flash Storage https://docs.redis.com/6.4/rs/databases/redis-on-flash/ - Redis on Flash, a feature of Redis Enterprise in 6.x https://docs.redis.com/latest/rs/databases/auto-tiering/ - Auto Tiering, a feature of Redis Enterprise in 7.x

I'd be happy if I could see a similar feature (or even just the eviction alternative) it in the successor of Redis OSS. But this probably require more talks and some support for development :)

mattsta commented 5 months ago

Yeah, it's one of those "would be fun/neat/interesting" ideas to have and technically fun to build over time, but it's worth considering if other projects already do this better and the effort to build new things vs. who would use it.

These systems have been tried different ways in the past (keep every key in memory, but note which are live vs. paged out to disk; or if a memory check fails check the disk index before returning failures, etc) and it also depends how much you want to optimize for storage systems like the specific flash notes you mentioned above.

Another amusing part is hardware risk. This also seemed like a fun project to explore when Optane was really taking off 8-10 years ago, but now Optane is EOL so all that work around programming for a specific single-vendor "persistent-RAM-reboot-safe" storage modality is just wasted. yay tech industry.

zuiderkwast commented 5 months ago

@soloestoy you mentioned there are some services in China that do this as a budget alternative, because disk is cheaper than ram. How do these work? Just curious.

I generally agree with Matt that what we need is not more features. If we do less, we can do it better.

PingXie commented 5 months ago

I see a balance to achieve here between sound engineering and user requirements.

To use the OS analogy, I think we need something similar to the microkernel architecture, i.e., we need new features but we don't need to build all of them to the "kernel". I would imagine that we need a core engine, that owns the infrastructure such as networking, replication, (multi)threading, and core data structures (strings/lists/set/hash/...). This bare minimum system would cater to all standalone caching use cases. Features like cluster support, scripting (via Lua or other languages), etc should be built as part of this project but as modules only. This also includes the data tiering feature that @soloestoy explained before. By moving these features into their own modules, we will significantly reduce the coupling between the core engine and the modules and among modules themselves, hence speeding up the innovations in both the core engine and the new features (such as support "data tiering").

soloestoy commented 5 months ago

Disks can offer larger capacity and lower cost, but disk-based storage and memory-based storage represent two very different development directions and research areas. Indeed, in China, there are many disk storage products compatible with the Redis protocol. They all use RocksDB as the disk storage engine, and then they have a coding layer that maps complex data structures to pure key-value pairs in RocksDB, for example:

And to enhance the efficiency of disk access, multi-threading is used to access the disk, which also introduces designs for concurrency control. Overall, this is a complex engineering task. Currently, using disk storage is not our top priority I think.

I also want to share some of my views on disk-based storage. Many people think that Redis/Valkey uses memory to store data, and since memory is volatile, it can lead to data loss and cannot guarantee data reliability. Only disk storage can ensure data reliability. I do not fully agree with this view.

First of all, although Valkey uses memory for storage, it also supports persistence. For example, when appendonly (AOF) is enabled, write commands are appended to the log, and even if the process crashes abnormally, data can be recovered from the AOF file. If there is a high demand for persistence, setting appendfsync to always can ensure that every write command is "immediately" flushed to disk. This is not too different from the Write-Ahead Logging (WAL) mechanism of traditional disk-based databases.

Furthermore, through the above methods, whether it is memory-stored Valkey or traditional databases stored on disks, they can only ensure the reliability of data on a single machine. If the machine crashes or the disk is damaged, data recovery is impossible. I believe data reliability relies more on replicas; storing data across multiple replicas to avoid data loss due to single points of failure.

However, data replication between primary and secondary replicas in a multi-replica setup is a serious topic. Currently, because we use an asynchronous replication mechanism, we cannot fully guarantee data consistency between primary and secondary replicas. There are data discrepancies between the primary and secondary replicas, which may lead to the loss of data not yet replicated to the secondary replica when the primary database crashes. Addressing the consistency issues between primary and secondary replicas is a challenge, but I believe it is a problem and direction we should focus on solving in the future in Valkey.

PingXie commented 5 months ago

@soloestoy I think you touched upon a few points that resonate with me really well. I see two high level requirements in this topic:

cost efficiency a. I too see value in "data tiering" and RocksDB is a great/common engine option but I am sure there are other options too. b. My understanding of the "data tiering" value comes mostly from the Redis ecosystem (such as the client, tooling, and experience/expertise) and the cost benefits. It is a cost play at the end of the day. c. AOF also doesn't help with the "cost" ask. You still need to hold all of your data in RAM.
data durability a. AOF does not provide true durability in an HA environment, even if the always policy is used. b. In the non-HA case, the AOF user sacrifices "availability" for "durability", also not an ideal situation. c. The lack of synchronous replication is indeed the first hurdle IMO that needs to be overcome in order to achieve RPO_0-like true durability. This is quite a departure from where Redis started, philosophically speaking, but IMO can be introduced as an opt-in mode.

BTW, the multi-threading support to improve disk access efficiency should be a separate concern from the core engine IMO. I can also see solutions like RocksDB could help in both cases (though not completely) though I feel helpful to look at the problems on their own first.

zuiderkwast commented 4 months ago

Looking at docs, I find that there was an early feature "virtual memory" which was deprecated in Redis 2.6 and later removed. There's still a document about it in our doc repo: https://github.com/valkey-io/valkey-doc/blob/main/topics/internals-vm.md

madolson commented 3 months ago

Reposting a previous comment from: https://github.com/valkey-io/valkey/issues/553

Yeah, naively mapping disk to memory doesn't work very well, since you see huge latency spikes when you have a virtual memory miss and need to fetch the memory page from disk. You can theoretically hide that if we were multi-threaded, since other threads would continue to get scheduled, but our single threaded architecture get's hurt too much by it.

A virtual memory like approach could work though if we built it ourself in userland. We could make a pretty minor change to the main hash-table to indicate that a key is "in-memory" vs "on-disk". Before executing a command, we can check if the command is in-memory, and if it is we execute the command normally. If it's on-disk, we can do one of two things:

Implement logic to go execute the command for an on-disk operation. I think this is similar to what Zhao mentioned with rocksdb in other threads. We fetch the data into memory, and once it's there we execute the command as normal. We would need a way to spill items to disk as well.

zuiderkwast commented 3 months ago

Reposting another comment from #553.

With the "on-disk" flag per key, the key's name still consumes memory. I have another idea: We use a probabilistic filter for on-disk keys. If the key is not found in memory (main hash table) and the feature is enabled, then we check the probabilistic filter. If we have a match, we go and fetch the key from disk. This can allow a larger number of small keys on disk that what we even want to store metadata for in memory.

We can use new maxmemory policies for this. Instead of evicting, we move a key to disk.

If we implement some module API for these actions (evict hook, load missing key hook), then the glue to rocksdb or another storage backend can be made pluggable.

hwware commented 3 months ago

Memory is expensive, Disk is cheap. I still remember one impressive word from Redis is that Redis is the fastest database in the world. With the feature storing data in disk, it is undouble accessing Valkey speed will decrease.

In Valkey, we have other high priority features to enhance, including how to achieve better data consistent among nodes (standalone mode and cluster mode) , better cluster architecture, better HA, etc.

And also, if we involved the data consistent codes in core, developers need take more time to maintain it. So i think now it is not a good time to touch this area unless it works an individual mode

hwware commented 3 months ago

Reposting another comment from #553.

With the "on-disk" flag per key, the key's name still consumes memory. I have another idea: We use a probabilistic filter for on-disk keys. If the key is not found in memory (main hash table) and the feature is enabled, then we check the probabilistic filter. If we have a match, we go and fetch the key from disk. This can allow a larger number of small keys on disk that what we even want to store metadata for in memory.

We can use new maxmemory policies for this. Instead of evicting, we move a key to disk.

If we implement some module API for these actions (evict hook, load missing key hook), then the glue to rocksdb or another storage backend can be made pluggable.

You can play with keydb, compiling it with make ENABLE_FLASH=yes, then data will be stored in disk

PingXie commented 3 months ago

Memory is expensive, Disk is cheap. I still remember one impressive word from Redis is that Redis is the fastest database in the world. With the feature storing data in disk, it is undouble accessing Valkey speed will decrease.

In Valkey, we have other high priority features to enhance, including how to achieve better data consistent among nodes (standalone mode and cluster mode) , better cluster architecture, better HA, etc.

And also, if we involved the data consistent codes in core, developers need take more time to maintain it. So i think now it is not a good time to touch this area unless it works an individual mode

The way I look at it is that there should ideally be a knob that allows users to trade off between cost and performance (and consistency too at some point). The Redis/Valkey ecosystem (think of the client/tooling/etc) makes it attractive for (some) users to consolidate their workloads on Valkey. Not everyone needs sub-millisecond latency all the time but knowing that there is an option to get sub-millisecond if needed without switching to a different storage backend is very appealing IMO. Also this would reduce the need/complexity of running/maintaining two systems (a DB and a caching system).

It is indeed a departure from the project's caching root but there seems to be a significant amount of user interests in this area. So I think it at least warrants some deep-dive research. I am hopeful that data-tiering can be introduced in a relatively clean way.

madolson commented 3 months ago

I also took a lot of inspiration from this paper, https://www.vldb.org/pvldb/vol6/p1942-debrabant.pdf, which sort of outlines this structure where "RAM" is the primary medium of storage, and we offload some data to disk as needed.

It is indeed a departure from the project's caching root but there seems to be a significant amount of user interests in this area.

I honestly think this type of disk based storage is more aligned with caching than many of the other beyond caching workload Redis came to be associated with. Caching is just a cost optimization game. Your working set is small, you're usually bottlenecked on network/cpu, if your working set is massive, the high premium for RAM will eat into your costs. If you can serve 80% of the request from an in-memory cache and 20% from a disk-based cache, you're probably still coming out ahead.

I am hopeful that data-tiering can be introduced in a relatively clean way.

Me too. We aren't going to make tradeoffs that hurt users. It's a great area to explore.

pizhenwei commented 1 month ago

I'd like to share some performance on 4K IO size testing: HDD: 100~200 IOPS, 10ms latency SSD: 50K IOPS NVME: read 500K IOPS, 80us latency; write 100K IOPS, 20 us latency. (The latest product has read 800K(4000X ~ 8000X of HDD) IOPS 60us latency, write 200K(1000X ~ 2000X of HDD) IOPS 10us latency)

From the point of my view, base on the modern backend storage, this topic may has better performance than the previous test.

The memory(RAM) latency of Intel modern CPU is 80ns, and AMD ZEN series have 120ns memory latency(because of multi-dies micro architecture). There is still a huge gap between RAM and NVME, simply mapping disk to memory is still not a good choice.

So I agree with @PingXie , if Valkey provides more feature to module, and Valkey runs as microkernel like engine, then it's possible to build any high performance storage engine based on the modern storage. (RocksDB may be an option, but I guess it would not be the best one on NVMe)

raphaelauv commented 1 week ago

another source of possible inspiration is https://github.com/apache/kvrocks

valkey-io / valkey

[NEW] Data Tiering - Valkey as general data store not just in-memory data store #83