Closed steviez closed 1 year ago
Some thoughts:
If one is thinking of implementing one's own index anyway and also needs a application-layer storage manager (which is probably a good idea for random reads from disk), maybe it's better to do from scratch, and choose a good storage manager which leverages direct IO into the block device, and kernel capabilities like async IO, io_uring.
One can leverage the easy programming model of mmap without the kernel overhead. Port of AppendVec and implementation of new index could be potentially be easy, with the storage layer taking care of the "hard" stuff.
NVMe Key Value (NVMe-KV) Command Set would be far more interesting to simplify a storage cluster (assuming rmda access) - help solve ledger archival w/ significantly better perf
Proposition
A posting in Discord recently led to some discussion over the NVMe Key Value (NVMe-KV) Command Set and its' potential to give us a speed-up / simplify things for our storage needs. The quick synopsis is that a storage device+driver would natively support key-value storage / lookup instead of running a key-value solution (such as RocksDB) on top of a traditional block storage device. A Samsung press release claims the following from this new technology:
NVMe Specification
NVMe has a feature specification (1) available that outlines the lower level interface. Some key things:
It was called out that this specification is fairly low level; there is an open source library that provides a higher level API available (see next section).
Supported Hardware / Vendors
As far as I could tell, the IP regarding this technology is currently in a "prototype phase" and is not yet widely adopted. As noted earlier, Samsung had a press release where they mentioned having a prototype back in September 2019. I believe that press release corresponds to this more technical document (2). The following note is from this document which makes it seem any regular SSD could be converted to a KV SSD in software:
There is an open source software package for all of this at: https://github.com/OpenMPDK/KVSSD
I didn't actually run through the steps, but it looks like there is a well documented process for trying out the software and getting a benchmark up and running. Something worth noting is that there is both a kernel and user component of the driver to make this work; it doesn't appear that the kernel driver has been upstreamed yet.
Performance
Sticking with the Samsung prototype (as this was the only thing I could find substantial information on), the same document outlines a case study benchmark versus RocksDB. The Samsung prototype wins out; however, I'm not sure the gains mentioned would extend to us, and some more thought should be given to this. The benchmark setup / results start on page 5 of the same Samsung document.
I also found a paper (3) that pokes some holes in the KV SSD and proposes a new solution. It is an interesting read if you have the time (even just reading sections 1 and 2 as they succinctly capture some good general information), but the paper makes note of SSD controller DRAM capacity to SSD storage capacity, and how this ratio could cause issues depending on key/value sizes.
TODO: Repeat above math for some of our use cases (get reasonable estimates for key/value size)
Final Thoughts
If we were to pursue this further, there are several open items that stick out to me: 1) The software API to use this is currently in C++ and would need to be wrapped for our Rust stack 2) Need to do some more analysis / benchmark on a case that is representative of our workload (this could probably be done in C++) 3) The kernel IP not being upstreamed ...
Item 3) seems like it'd be a dealbreaker in regards to pushing this out to validators on our network
Links / Resources
(1) NVM Express Key Value Command Set Specification 1.0 (May 18, 2021) (2) Samsung Key Value SSD enables High Performance Scaling (3) PinK: High-speed In-storage Key-value Store with Bounded Tails (4) The Key to Value: Understanding the NVMe Key-Value Standard