Open vogel76 opened 6 years ago
The RocksDB prototype has been successfully completed. This R&D lead to conclusion that RocksDB can satisfy needs of Steem project as persistent storage layer implementation.
OK, so I'm confused. I must confess I have not yet read the actual code. Because I don't feel ready to read the code until I understand what this is doing architecturally. Which is still not explained here.
So I will try to start at the beginning, so we can build a shared understanding.
The design of the steemd
state implementation is basically this:
-------------------------
Block / tx / op (chain)
-------------------------
Undo (chainbase)
-------------------------
Index / Iter (boost::mic)
-------------------------
Container
-------------------------
The core functionality is a container holding a set of objects that can be created, modified, and removed over time. The container has integrated indexing and iteration features, allowing each type of objects to have its own index. This much is provided by boost::multi_index_container
for in-memory containers.
Then on top of boost::mic
we have the undo semantics provided by chainbase
[2]. This allows any number of sessions to be started left-to-right. When an object is CRUD [3], sufficient information to undo the CRUD is recorded in the rightmost (most recent) session. The rightmost session can be undone, using the recorded information to restore the objects as they were before any CRUD that occurred in the session. The leftmost session can be committed, discarding the recorded information and making changes permanent.
The code that implements blockchain operations does two things: reading (querying), and writing (CRUD). Since the COW keeps the latest information in the main boost::multi_index_container
, any CRUD followed by a query to the boost::multi_index_container
has read-after-write consistency, which is assumed in many, many places in the design and code. So queries and iteration are done by directly accessing boost::multi_index_container
[4].
What part of this picture is being replaced by RocksDB? Not knowing anything about RocksDB other than what I learned in approximately three minutes on the homepage, it seems like RocksDB is a generic database embeddable in C++. Which means, to me, that it does the same thing boost::multi_index_container
does, except boost::multi_index_container
does it all in memory, but RocksDB puts most of its records on disk.
So the obvious answer (thinking only about the architecture, and not having read the code) is that RocksDB is intended to replace boost::multi_index_container
. But if that is what you are doing with RocksDB, you would have to replace the very template-intense query / iteration API of boost::multi_index_container
, or change many call sites throughout chain
, chainbase
and multiple plugins. Unless the RocksDB C++ API is specifically designed to be compatible with boost::multi_index_container
, which seems unlikely to me. The obvious answer doesn't seem correct.
A less obvious answer is that maybe RocksDB is acting as a sink for the output of the account history plugin. Basically, the plugin registers an event handler, then the event handler sends the data to a RocksDB database. If that is what you are doing, then you have to do quite a bit of work and testing to get the undo semantics correct. You see, when the existing event handler code saves account history in chainbase
, the code relies on the fact that chain
and chainbase
will erase the relevant portion of saved history when a fork occurs. If you use a different database, you have to add your own code to detect forks and handle them identically. One Big Problem [5] is that most databases' transaction semantics isn't quite the same as the left-commit, right-undo COW functionality of undo sessions detailed above. So it's not as simple as adding a hook.
So you may implement code to detect forks and erase the relevant state specifically for account history entries. You could do that, and it would work in theory. But you shouldn't do that, because it won't work in practice the way you think it will work in theory. Specifically, it creates a ripe breeding ground for potentially disastrous bugs [6].
[1] In RDBMS terms, each table has its own indexes.
[2] It's called "undo" in the code, but perhaps more conventionally referred to as COW (copy-on-write).
[3] By "CRUD" I mean "CReate, Update, Delete," in steemd
CRUD is the methods db.create()
/ db.modfy()
/ db.remove()
[4] This is not quite true. Single-result queries can be done with get()
/ find()
methods in chainbase
or database
. But iteration still needs to access the index directly.
[5] This Big Problem is basically the thing that convinced us we need to effectively roll our own RDBMS using multi_index_container
for what was then envisioned as the Graphene engine powering BitShares 2.0, but the code eventually migrated to Steem.
[6] BitShares 0.x required this sort of code to be implemented for each object type. Doing it non-generically led to a miserable experience where much boilerplate code had to be written, which frequently ended up being buggy due to omitting cases, which led to frequent, hard-to-debug desyncs of production nodes when forks occurred on the main network. Based on this experience, this architecture is clearly a step backward and if I have any say in the matter, I will not allow such architecture to be merged into Steem.
My understanding is RocksDB would be a replacement for the container code and would require writing a template interface that mirrors boost::multi_index_container
. (#2041)
If such a feat can be pulled off, then chainbase can be left as is. All of the index code is templated and assumes template instances are multi index containers. I do not recommend replacing the undo code specifically for RocksDB (#2040). I have suggested for the account history plugin that we use Chainbase for reversible state and RocksDB for irreversible state to avoid having to reimplement a critical component of our codebase.
boost::multi_index_container
functionality, storing data (at internal implementation side) in RocksDB. This should simplify replacement of original boost::multi_index_container
instances step by step, without a need to do it at once.ColumnFamilies
. Even they are not obligatory, they allow logical organization of data stored in database (make things much better to understand). Thus it is possible to associate multiple key->value mappings (which are essential of this storage) into logical groups called tables (equivalents of boost::multi_index_container
). Number of key->value associations is defined according to indexed_by
clause contents specified at boost::multi_index_container
instantiation (that's quite easy process basing on few parts of boost::mpl
library)
The primary by_id
index associates ID => <serialized-object>
, subsequent indices <key> => <ID>
@theoreticalbts I have a doubt related to your description of undo_state
behavior and code implementing it inside chainbase::generic_index
related methods:
The rightmost session can be undone, using the recorded information to restore the objects as they were before any CRUD that occurred in the session. The leftmost session can be committed, discarding the recorded information and making changes permanent.
Looks like that any change written to the index is made immediately permanent as it is directly stored in the boost::multi_index_container
storage - just original object's value is preserved for potential undo. So there is no explicit commit as you have described, which makes the changes permanent. Also mentioned commit
method just discards all recorded undo-buffers up to specified level, instead of applying recorded change-set onto final storage (what happens in RDBMS-like commit).
Such approach is very simple to be adapted to RocksDB approach, since we can have separate storage for undo-states (so, IDs of new objects, preserved copies of removed/modified ones, similary to current one). This solution allows to do correct adjusting of actual storage by accepting a notification sent from main database (simillary to things done atm, by methods undo
, commit
, on_create
, on_modify
, on_remove
).
As a side note to previous point, I would like to discuss also mentioned problem of using transactions mentioned here:
One Big Problem [5] is that most databases' transaction semantics isn't quite the same as the left-commit, right-undo COW functionality of undo sessions detailed above. So it's not as simple as adding a hook.
It is impossible to use transactions because they introduce data isolation (which also is one of key features to RDBMS), and what will change behavior of chainbase (which does not support this feature), so it is not acceptable. Another aspect of use transactions, is their lifetime, which by definition shall be as short as possible, so having in mind that steemd-storage can stay at fork long time makes this feature useless. Fortunetely RocksDB is designed well, and offers multiple layers of its implementations - there is separate version offering transactional access called TransactionDB/OptimisticTransactionDB. The one chosen in prototype is simplest one without transaction support.
A less obvious answer is that maybe RocksDB is acting as a sink for the output of the account history plugin. Basically, the plugin registers an event handler, then the event handler sends the data to a RocksDB database. If that is what you are doing, then you have to do quite a bit of work and testing to get the undo semantics correct.
Yes, plugin registers on-operation event handler and there is no option to avoid it, because of chainbase::database
architecture: virtual operations can be processed only from this level of block processing. They are created as temporary object during processing blocks/transactions/real-blockchain ops, and there is no other option to collect them. The key thing to understand, is that undo-state management must be driven by chainbase::database, just persistently stored in a way specific to chosen solution (here RocksDB, originally shared memory file).
lower_bound
, upper_bound
functions exposed by boost::multi_index_container
and massively used in the codebase.
Implementation shall allow advanced data caching to significantly reduce memory usage in compare to whole persistent storage size.