[CSL-265] Key-value database: initial research and development about its integration

@gromak

Ok, so some intermediate results. Basically it consists of two related parts: the first one is making leveldb/rocksdb compile and work, the second one is rewriting cardano-node to use it. Let me start with the first part. In general I think results are good, but it took more time than I wanted it to take. When I installed leveldb-haskell and tried to build cardano-sl with this library as dependency I was getting Attempt to free invalid pointer error from tcmalloc. I tried to investigate it, but didn't manage to do it. This error didn't happen for very tiny projects, but for it happened for any non-tiny project which I tried to build (adding dependency was enough to break, I didn't even import it). The same was true for rocksdb-haskell. I decided not to waste time on this error (there were too many places where the reason could be) and decided to use jemalloc (which is in general better than tcmalloc, especially when you have fragmentation problems :wink:). Fortunately it's very easy to build rocksdb with jemalloc: nowadays it's enough to install jemalloc and then rocksdb will use it automatically. So I decided to use rocksdb instead of temporarily using leveldb and migrating a little bit later (which would be almost trivial, but anyway). Then I switched to research of the second part for a while, but it will be described later. Then I tried to actually use rocksdb, i. e. get at least one put and get. Some notes about it: • bindings are outdated, names of some functions have been changed, as well as their signatures. Some types have been changed too. So fork is inevitable and I have already forked that repository. Well, we can work with 3-years old version of RocksDB instead, but it's ridiculous. • I renamed it from rocksdb-haskell to rocksdb for two reasons: a) it seems more reasonable for me (why would one add haskell prefix to haskell package :confused:) and b) if we decide to upload it to hackage at some points, we'll need different name. • Interface is very stupid in some places. For example, get (and many other functions) requires MonadResource even though it doesn't use any function (there is get :: MonadIO m => Args -> m () and then it's exported as get :: MonadResource m => Args -> m (), get = Base.get. Very stupid. We will change it. • Since rocksdb provides C API as well (c.h), writing bindings is very easy: https://github.com/agrafix/rocksdb-haskell/blob/master/src/Database/RocksDB/C.hsc • Since binding are outdated, I got some linkage problems. I just commented out this functionality for a while: https://github.com/serokell/rocksdb-haskell/commit/9910b9dc5c49c660eb7b5de233c11717e133f0b9 • Fortunately, main functionality is up-to-date, so eventually I made it work. See cardano-tmp executable in feature/csl265-rocksdb branch. It works fine on my computer. So if we decide to use rocksdb we need to actualize bindings (which is an easy task, just port types from C/C++ to haskell, mostly a routine), improve interface (at least eliminate stupid things, see above) and that's all. Maybe we'll need to write something more complex if we see that we need some complex feature (e. g. merge operator). But let's add bindings to new features lazily. The most basic functionality works already. Of course we'll need to tune options later, so we'll need to actualize some bindings anyway.

@gromak

The second part is about rewriting cardano-sl to use rocksdb. This part is still in progress, at least because the first part took more time than expected. Some notes about it: • block processing logic is bad now and will be reworked (currently I think we should use approach similar to Bitcoin's getheaders, where blocks are received from oldest to youngest, but I haven't finished this research). So we shouldn't blindly port current logic to rocksdb. As I said many times, issues are tightly related (switching to another DB, CSL-5, improving block processing algorithm). • Our approach should be based on another approach is known to work in production. Most likely it will be Bitcoin (I imply Bitcoin Core here) as the most popular and well described (well, I didn't read about any other cryptocurrency implementation, so it's not correct to say about well-described property). We are not experts so we shouldn't invent something totally new and different. Of course it's not necessary to precisely copy everything from Bitcoin (we use very different language with different set of libraries and concepts after all) and it's impossible to precisely copy it (because our algorithm is different, lol, we have SSC, for example). I have read how Bitcoin stores data. Tl;dr if you don't know: ★ blocks are stored in plain files (serialized in network format as is). Also there are plain files with data necessary to rollback blocks. ★ There are two leveldb databases: block index and utxo. ★ Block index lets one find blocks in plain files quickly. It has some service information which can easily imagine I guess. ★ Structure of utxo database is not very interesting now too. Some further thoughts: ✱ Bitcoin stores block index fully in memory. ~So let's use acid-state.~ It sounds reasonable (for Bitcoin), because I guess the heaviest part is block index records, each record is about 100 bytes (my rough calculation), current height of blockchain is 4.5 * 10^5, so together these records should take only 45 Mb. Other parts are probably smaller, so all this index takes about 80 Mb now, I suppose and should easily fit in memory for a long time (probably RAM volumes grow faster). ✱ Bitcoin builds some smart in memory caches to speed-up access to utxo (it is not fully stored in memory). I am not sure if we need it. RocksDB has supports LRU cache and Bloom filters, so read access should be faster. Probably it won't be necessary to create our own cache on top of it, but it has to be benchmarked. ✱ Some differences between our algorithm and Bitcoin which can be important: — we need to know balances to check threshold. — We need to validate SSC part of block before accepting it. — We can't check consensus w/o using blocks body (for sequence of blocks at least). ✱ Apparently accepting blocks from oldest to youngest simplifies things crucially. ✱ I have some ideas regarding threads synchronization (e. g. how to deal with situations when two threads received two different valid blocks), but not a final understanding. I want to read https://en.bitcoin.it/wiki/Bitcoin_Core_0.11_(ch_6):_The_Blockchain and think more about it. Utxo database stores hash of head block header, when we receive block, we make a snapshot (rocksdb supports them), check that hash is really parent of received block, verify all transactions using the same snapshot. If we decide to update utxo, we perform a batch update. One of the problems which I have in mind: how to ensure that utxo db and TVar with SSC data have the same version (i. e. correspond to same block). ✱ Btw, as I wrote somewhere, I think we don't need to store SSC data on disk at all. And now I am tired of writing. I will continue research tomorrow and hopefully will start writing some code as proof-of-concept.

@gromak

Ok, so as a result of this issue we have done the following:

We did some research regarding state of bindings to leveldb and rocksdb. Story about it can be found in the first comment, let me only write results here. Bindings to rocksdb are outdated, but main part works. We have already [https://github.com/serokell/rocksdb-haskell forked] them. Make sure to build rocksdb (C++ library) with jemalloc (it will be done automatically if you have jemalloc in your system). So now we only need to actualize these bindings. I will create a separate issue about it.
We did a big research about how to store data and work with it. I will describe results of this research below, because I expect them to be very long (though I will try to do make them short and w/o unnecessary information). I think that now we have an understanding of how to do everything. Some technical details are missing, but they don't seem to cause big difficulties. Also there are some possible improvements and open questions about them, but they can be discussed later.
We started playing with rocksdb and writing code in cardano-sl which uses rocksdb. Also we made a refactoring in rocskdb-haskell, because old interface was ridiculous. Main development will be done in separate issues.

In the next message I will write detailed plan regarding (2), after that I will create new issues and close this one.

@gromak

So now some details about how to store data and work with it. First of all, let me describe how to store data. We store the following data: • Utxo DB. It is a RocksDB database where we store Utxo = Map TxIn TxOut (unspent outputs), Balances = Map Address Coin (which can be calculated from Utxo, see below why it's needed), HeaderHash (for block after which this Utxo takes place). • Blocks DB. It is another RocksDB database where conceptually we store blocks, undo data and blocks index (i. e. some data necessary to find blocks quickly). In practice we can store (Block, BlockExtra) associated with key 'b' + hash of block's header and UndoData associated with key 'u' + hash of block's header. Here BlockExtra is whatever extra data which we need to know about block and which is not part of Block. Currently I think that we only need to store whether block is part of main chain (see below). UndoData is extra data that we need to revert application of block. In our case UndoData is [[TxOut]]. Undo[i][j] is TxOut corresponding to j-th input in i-th transaction in block. Future improvement: consider storing blocks in raw files instead of database. Bitcoin does it this way. Probably the reason is that this way migration to another database is cheap. Another possible reason is that storing too much data in RocksDB may affect performance. Also RocksDB has space amplification, but we don't want any amplifier on size of blocks. We may only want to compress them. • SSC data. We need to store some data to verify SSC payload from received blocks quickly and to maintain our local SSC payload for inclusion into block if we generate one. I expect size of this data to be small and I don't think it makes sense to store it on disk, let's store it in memory. Persistence is not needed, because it can be relatively quickly calculated from blocks. There are two reasons why it's small: it's reset when new epoch starts and size of committee will be small. The only concern is about Vss certificates (because they are not reset when new epoch starts). But I think they won't take much space as well, because we will have limit on TTL. My current vision of what should be stored in SSC data is the following (in GodTossing case, NistBeacon is trivial). We need to store GtGlobalState which takes place right after last processed block. We also need to store local GtPayload. When we receive GT data we verify it against both GtGlobalState and GtPayload. We don't need any ''undo'' data for GT, because content of block should be enough to rollback GT data. • Mempool of transactions. These are transactions which we will put into block if we create one. They will be stored in memory. We also need to store an update of Utxo. It can be stored as (Utxo, HashSet TxIn) where the first part is data to add to Utxo and the second part is data to remove from Utxo.

We can also make some optimizations, e. g. caching. For instance, we can store in-memory cache of last accepted transactions (which is done now) together with mempool. We can also have some cache for Utxo, but let's not hurry with it. RocksDB lets one use LRU cache for it, so maybe it won't be necessary. Let's not do optimizations w/o profiling or something.

P. S. It's only an overview, I am almost sure that something else will be necessary.

@gromak

The next part is about block processing. We will store some MVar in context (NodeContext) which will ensure that only one thread can apply block to state (to Utxo DB, SSC data, mempool of transactions). It makes sense to store HeaderHash in this MVar for convenience (which will correspond to our tip, i. e. head of main chain).

Let's start with the simplest case of block processing. Suppose our main chain is A B C D E. And we receive header F whose previous block is E. We check that F is valid (e. g. BodyProof is ok) and request this block. When we receive this block we check its data against Utxo DB and Ssc data. We must ensure that checks are performed for correct versions (for versions corresponding to E). Note that Utxo DB and Ssc data can change at any moment, because we haven't taken MVar yet. To make sure that all checks are performed for correct versions we use Snapshot in case of RocksDB and atomically for Ssc data. First of all we check hash and only then everything else. Note that we need to return different results when check fails because of hash mismatch and because of malformed data (only in the latter case we need to ban peer). If we see that F is a valid continuation of E we take MVar and modify Utxo DB and Ssc data (applying block F to them). It's important to check hash after taking lock. We also modify txs mempool (because some txs could become invalid). We also put information about F into Blocks DB, it can be done w/o lock, because this data is immutable. I suppose that's all about processing of a single block which continues main chain.

Now let's proceed to more complex cases. There are two difficulties: 1) we may have outdated state and need to fetch a lot of blocks and 2) forks can appear. Let's consider the most general case of block processing. Suppose our main chain is A B C D E. And suppose someone else has chain A B C DD EE F. Sooner or later we receive header of F. Then we request headers of blocks up to F which have common ancestor with us. We receive headers of some subset of A B C DD EE F. Then we go through these headers and ignore those which are already part of main chain (recall from previous message that we store this information in Blocks DB). After ignoring some (possibly none) headers we request all other blocks and receive them one by one. When we receive DD we can't simply apply it, because it is not a continuation of E. To solve it we introduce stateful sockets which let us associate some state with socket (i. e. with peer). One thing from this state is [HeaderHash] which stores all hashes for which we expect to receive blocks. It is empty if we don't wait for any block, it's fully updated when we send getblocks and hash is removed from it when we receive block. When we receive block whose prev is our tip, we simply apply procedure described above (and remove hash from state of socket). When we receive block whose prev is not our tip, we put it into state of socket and wait for next blocks until we get block which is more difficult than our tip. After we get DD, EE, F we try to apply them. It consists of LCA search, rollback, verification, application. Anyway, it's similar to what we have already. Sometimes we may want to request more blocks than one getblocks message allows us to request (after long offline). For this we store HeaderHash of block up to which we want to receive blocks. After receiving all blocks for one getblocks we send another one getblocks (probably getheaders first).

P. S. I didn't want to write too much, so probably omitted some details. You can ask me questions if you have any.

@gromak

Now few words about block creation. When we create block we simply take transactions from mempool and GtPayload from SSC data. Since we are using MVar to synchronize threads, we can be sure that our mempool and GtPayload correspond to the same version. So yes, we know tip, we know payloads, we create blocks, everything is easy.

@gromak

And (hopefully) the last part is about calculation which require knowledge about balances. We need it in two cases:

To check that node has enough stake for being part of committee (CSL-93).
For FTS (CSL-90 is related). After a long discussion we decided to store balances in Utxo DB (balances = Map Address Coin). When epoch finishes we iterate over balances and do two things: FTS and find all nodes whose stake is above threshold. We store both results somewhere so that we don't recalculate again. This way CSL-93 is as easy as checking whether node is in this set of rich nodes. Note that we use the same stake distribution to find rich nodes as the one used for FTS. This is mostly for convenience. Current implementation of FTS should be somehow generalized to work with iterator.

Possible future improvement: start FTS and rich nodes calculation earlier than end of epoch. Restart if big rollback happens. Possible future improvement: don't iterate through all balances (maybe using B-tree). Future: when we start taking delegated stake into account, let's store balances with delegated stake.

@gromak

Some isses have been create as a result of CSL-265. Also some issues have been updated in a sense that at least their essence has changed. New issues: • CSL-275 • CSL-276 • CSL-277 • CSL-278 • CSL-279 • CSL-280 • CSL-281 • TW-96 • CSL-282 • CSL-283

Updated issues: • CSL-93 • CSL-90

I am going to close this issue soon, let's work in fine-grained issues since now.

input-output-hk / cardano-sl

[CSL-265] Key-value database: initial research and development about its integration #638