solana-labs / solana

Web-Scale Blockchain for fast, secure, scalable, decentralized apps and marketplaces.
https://solanalabs.com
Apache License 2.0
12.96k stars 4.16k forks source link

Evaluate Key Value SSD #17959

Closed steviez closed 1 year ago

steviez commented 3 years ago

Proposition

A posting in Discord recently led to some discussion over the NVMe Key Value (NVMe-KV) Command Set and its' potential to give us a speed-up / simplify things for our storage needs. The quick synopsis is that a storage device+driver would natively support key-value storage / lookup instead of running a key-value solution (such as RocksDB) on top of a traditional block storage device. A Samsung press release claims the following from this new technology:

There are numerous benefits of KV storage technology. Rather than operating as a block device, the KV SSD moves resource-draining storage operations from the host CPU to the SSD itself. This results in:

  • Much-improved system-level performance
  • Freeing the CPU from computational work, such as block operations and storage-level garbage collection
  • Substantially greater scalability in the number of linked SSDs by reducing CPU overload
  • Greatly reduced write amplification (WAF)
  • Much less wear on each SSD
  • Greater software efficiency

NVMe Specification

NVMe has a feature specification (1) available that outlines the lower level interface. Some key things:

It was called out that this specification is fairly low level; there is an open source library that provides a higher level API available (see next section).

Supported Hardware / Vendors

As far as I could tell, the IP regarding this technology is currently in a "prototype phase" and is not yet widely adopted. As noted earlier, Samsung had a press release where they mentioned having a prototype back in September 2019. I believe that press release corresponds to this more technical document (2). The following note is from this document which makes it seem any regular SSD could be converted to a KV SSD in software:

Surprisingly, the Samsung Key Value SSD needs only standard SSD hardware, which is augmented by special Flash Translation Layer (FTL) software that provides its processing capabilities.

There is an open source software package for all of this at: https://github.com/OpenMPDK/KVSSD

I didn't actually run through the steps, but it looks like there is a well documented process for trying out the software and getting a benchmark up and running. Something worth noting is that there is both a kernel and user component of the driver to make this work; it doesn't appear that the kernel driver has been upstreamed yet.

Performance

Sticking with the Samsung prototype (as this was the only thing I could find substantial information on), the same document outlines a case study benchmark versus RocksDB. The Samsung prototype wins out; however, I'm not sure the gains mentioned would extend to us, and some more thought should be given to this. The benchmark setup / results start on page 5 of the same Samsung document.

I also found a paper (3) that pokes some holes in the KV SSD and proposes a new solution. It is an interesting read if you have the time (even just reading sections 1 and 2 as they succinctly capture some good general information), but the paper makes note of SSD controller DRAM capacity to SSD storage capacity, and how this ratio could cause issues depending on key/value sizes.

A primary design issue of hash-based KV-SSD is the management of a huge hash table requiring large amounts of DRAM. Suppose that the SSD capacity is 4 TB and the key and value sizes are on average 32B and 1KB, respectively [5]. If the number of buckets is 2^32 (= 2 ^ 42 / 2 ^ 10) and the bucket size is 36B (32B for a key and 4B for a pointer), 144GB of DRAM is required to hold the complete hash table. If KVSSDs have large enough DRAM to hold the entire hash table, in addition to the O(1) time complexity for calculating an index, a KV access only takes O(1) flash access to read/write the KV pair [44]. However, as mentioned previously, SSDs do not have as much DRAM.

TODO: Repeat above math for some of our use cases (get reasonable estimates for key/value size)

Final Thoughts

If we were to pursue this further, there are several open items that stick out to me: 1) The software API to use this is currently in C++ and would need to be wrapped for our Rust stack 2) Need to do some more analysis / benchmark on a case that is representative of our workload (this could probably be done in C++) 3) The kernel IP not being upstreamed ...

Item 3) seems like it'd be a dealbreaker in regards to pushing this out to validators on our network

Links / Resources

(1) NVM Express Key Value Command Set Specification 1.0 (May 18, 2021) (2) Samsung Key Value SSD enables High Performance Scaling (3) PinK: High-speed In-storage Key-value Store with Bounded Tails (4) The Key to Value: Understanding the NVMe Key-Value Standard

jon-chuang commented 3 years ago

Some thoughts:

  1. On the comments about hash table size: BTreeMap with 32 byte key runs into the same issue. So we'd have to implement our own index either way, to scale to 1B keys.
  2. What in-memory cacheing solution would we resort to? I seemed to be reading that KV Stacks has no block/page cache. So one would have to use a storage manager (something like leanstore, perhaps)

If one is thinking of implementing one's own index anyway and also needs a application-layer storage manager (which is probably a good idea for random reads from disk), maybe it's better to do from scratch, and choose a good storage manager which leverages direct IO into the block device, and kernel capabilities like async IO, io_uring.

One can leverage the easy programming model of mmap without the kernel overhead. Port of AppendVec and implementation of new index could be potentially be easy, with the storage layer taking care of the "hard" stuff.

journaux commented 2 years ago

NVMe Key Value (NVMe-KV) Command Set would be far more interesting to simplify a storage cluster (assuming rmda access) - help solve ledger archival w/ significantly better perf