steemit / steem

The blockchain for Smart Media Tokens (SMTs) and decentralized applications.
https://steem.com
Other
1.95k stars 790 forks source link

Implement dedicated persistent storage of state object model, be able to load data on demand #1987

Open vogel76 opened 6 years ago

vogel76 commented 6 years ago

Implementation shall allow advanced data caching to significantly reduce memory usage in compare to whole persistent storage size.

vogel76 commented 6 years ago

The RocksDB prototype has been successfully completed. This R&D lead to conclusion that RocksDB can satisfy needs of Steem project as persistent storage layer implementation.

theoreticalbts commented 6 years ago

OK, so I'm confused. I must confess I have not yet read the actual code. Because I don't feel ready to read the code until I understand what this is doing architecturally. Which is still not explained here.

So I will try to start at the beginning, so we can build a shared understanding.

The design of the steemd state implementation is basically this:

-------------------------
Block / tx / op (chain)
-------------------------
Undo (chainbase)
-------------------------
Index / Iter (boost::mic)
-------------------------
Container
-------------------------

The core functionality is a container holding a set of objects that can be created, modified, and removed over time. The container has integrated indexing and iteration features, allowing each type of objects to have its own index. This much is provided by boost::multi_index_container for in-memory containers.

Then on top of boost::mic we have the undo semantics provided by chainbase [2]. This allows any number of sessions to be started left-to-right. When an object is CRUD [3], sufficient information to undo the CRUD is recorded in the rightmost (most recent) session. The rightmost session can be undone, using the recorded information to restore the objects as they were before any CRUD that occurred in the session. The leftmost session can be committed, discarding the recorded information and making changes permanent.

The code that implements blockchain operations does two things: reading (querying), and writing (CRUD). Since the COW keeps the latest information in the main boost::multi_index_container, any CRUD followed by a query to the boost::multi_index_container has read-after-write consistency, which is assumed in many, many places in the design and code. So queries and iteration are done by directly accessing boost::multi_index_container [4].

What part of this picture is being replaced by RocksDB? Not knowing anything about RocksDB other than what I learned in approximately three minutes on the homepage, it seems like RocksDB is a generic database embeddable in C++. Which means, to me, that it does the same thing boost::multi_index_container does, except boost::multi_index_container does it all in memory, but RocksDB puts most of its records on disk.

So the obvious answer (thinking only about the architecture, and not having read the code) is that RocksDB is intended to replace boost::multi_index_container. But if that is what you are doing with RocksDB, you would have to replace the very template-intense query / iteration API of boost::multi_index_container, or change many call sites throughout chain, chainbase and multiple plugins. Unless the RocksDB C++ API is specifically designed to be compatible with boost::multi_index_container, which seems unlikely to me. The obvious answer doesn't seem correct.

A less obvious answer is that maybe RocksDB is acting as a sink for the output of the account history plugin. Basically, the plugin registers an event handler, then the event handler sends the data to a RocksDB database. If that is what you are doing, then you have to do quite a bit of work and testing to get the undo semantics correct. You see, when the existing event handler code saves account history in chainbase, the code relies on the fact that chain and chainbase will erase the relevant portion of saved history when a fork occurs. If you use a different database, you have to add your own code to detect forks and handle them identically. One Big Problem [5] is that most databases' transaction semantics isn't quite the same as the left-commit, right-undo COW functionality of undo sessions detailed above. So it's not as simple as adding a hook.

So you may implement code to detect forks and erase the relevant state specifically for account history entries. You could do that, and it would work in theory. But you shouldn't do that, because it won't work in practice the way you think it will work in theory. Specifically, it creates a ripe breeding ground for potentially disastrous bugs [6].

[1] In RDBMS terms, each table has its own indexes.

[2] It's called "undo" in the code, but perhaps more conventionally referred to as COW (copy-on-write).

[3] By "CRUD" I mean "CReate, Update, Delete," in steemd CRUD is the methods db.create() / db.modfy() / db.remove()

[4] This is not quite true. Single-result queries can be done with get() / find() methods in chainbase or database. But iteration still needs to access the index directly.

[5] This Big Problem is basically the thing that convinced us we need to effectively roll our own RDBMS using multi_index_container for what was then envisioned as the Graphene engine powering BitShares 2.0, but the code eventually migrated to Steem.

[6] BitShares 0.x required this sort of code to be implemented for each object type. Doing it non-generically led to a miserable experience where much boilerplate code had to be written, which frequently ended up being buggy due to omitting cases, which led to frequent, hard-to-debug desyncs of production nodes when forks occurred on the main network. Based on this experience, this architecture is clearly a step backward and if I have any say in the matter, I will not allow such architecture to be merged into Steem.

mvandeberg commented 6 years ago

My understanding is RocksDB would be a replacement for the container code and would require writing a template interface that mirrors boost::multi_index_container. (#2041)

If such a feat can be pulled off, then chainbase can be left as is. All of the index code is templated and assumes template instances are multi index containers. I do not recommend replacing the undo code specifically for RocksDB (#2040). I have suggested for the account history plugin that we use Chainbase for reversible state and RocksDB for irreversible state to avoid having to reimplement a critical component of our codebase.

vogel76 commented 6 years ago
  1. @theoreticalbts As specified in #2041 and commented by @mvandeberg the final solution should be facade template implementation supporting in its public interface all used boost::multi_index_container functionality, storing data (at internal implementation side) in RocksDB. This should simplify replacement of original boost::multi_index_container instances step by step, without a need to do it at once.
  2. One of RocksDB's features are ColumnFamilies. Even they are not obligatory, they allow logical organization of data stored in database (make things much better to understand). Thus it is possible to associate multiple key->value mappings (which are essential of this storage) into logical groups called tables (equivalents of boost::multi_index_container). Number of key->value associations is defined according to indexed_by clause contents specified at boost::multi_index_container instantiation (that's quite easy process basing on few parts of boost::mpl library) The primary by_id index associates ID => <serialized-object>, subsequent indices <key> => <ID>
  3. @theoreticalbts I have a doubt related to your description of undo_state behavior and code implementing it inside chainbase::generic_index related methods:

    The rightmost session can be undone, using the recorded information to restore the objects as they were before any CRUD that occurred in the session. The leftmost session can be committed, discarding the recorded information and making changes permanent.

    Looks like that any change written to the index is made immediately permanent as it is directly stored in the boost::multi_index_container storage - just original object's value is preserved for potential undo. So there is no explicit commit as you have described, which makes the changes permanent. Also mentioned commit method just discards all recorded undo-buffers up to specified level, instead of applying recorded change-set onto final storage (what happens in RDBMS-like commit). Such approach is very simple to be adapted to RocksDB approach, since we can have separate storage for undo-states (so, IDs of new objects, preserved copies of removed/modified ones, similary to current one). This solution allows to do correct adjusting of actual storage by accepting a notification sent from main database (simillary to things done atm, by methods undo, commit, on_create, on_modify, on_remove).

  4. As a side note to previous point, I would like to discuss also mentioned problem of using transactions mentioned here:

    One Big Problem [5] is that most databases' transaction semantics isn't quite the same as the left-commit, right-undo COW functionality of undo sessions detailed above. So it's not as simple as adding a hook.

    It is impossible to use transactions because they introduce data isolation (which also is one of key features to RDBMS), and what will change behavior of chainbase (which does not support this feature), so it is not acceptable. Another aspect of use transactions, is their lifetime, which by definition shall be as short as possible, so having in mind that steemd-storage can stay at fork long time makes this feature useless. Fortunetely RocksDB is designed well, and offers multiple layers of its implementations - there is separate version offering transactional access called TransactionDB/OptimisticTransactionDB. The one chosen in prototype is simplest one without transaction support.

  5. A less obvious answer is that maybe RocksDB is acting as a sink for the output of the account history plugin. Basically, the plugin registers an event handler, then the event handler sends the data to a RocksDB database. If that is what you are doing, then you have to do quite a bit of work and testing to get the undo semantics correct.

    Yes, plugin registers on-operation event handler and there is no option to avoid it, because of chainbase::database architecture: virtual operations can be processed only from this level of block processing. They are created as temporary object during processing blocks/transactions/real-blockchain ops, and there is no other option to collect them. The key thing to understand, is that undo-state management must be driven by chainbase::database, just persistently stored in a way specific to chosen solution (here RocksDB, originally shared memory file).

  6. RocksDB allows various way to access data, starting from single get (to retrieve the value pointed by key) or to iterate values matching to given partial-key, so it's easy to support lower_bound, upper_bound functions exposed by boost::multi_index_container and massively used in the codebase.