verovm / record-replay

GNU Lesser General Public License v3.0
17 stars 5 forks source link

substate database compatibily, portability, and optimization for batch processing #2

Open alkorang opened 1 year ago

alkorang commented 1 year ago

There are several reasons to switch from goleveldb to a new database backend.

We usually do a batch analysis of a range of blocks or filter transactions with their types. For these cases, KVDB like goleveldb, pebble, rocksdb may not be an ideal solution. So we can have a look at other database types like RDBMS. The new database must be local files and must support read-only mode for multithreading/multiprocessing like goleveldb. Database compression is not essential because we know major part of DB is substate_rlp and code, so we can compress them before inserting them into DB.

sqlite3 is one candidate of a new database backend. If we use sqlite, we should manually compress substate_rlp and code. sqlite uses a single file for a single database, so our recorder and replayer should partition dbs in proper size. zstd with dictionary is very fast and efficient for many small rows. gozstd supports zstd dictionary. The dictionary data is required for correct compression and decompression, so it must be stored in DB. It may be better to keep recorder to put substate_rlp as is and make substate-cli db compact to compresses substate_rlp and code.

(If it is feasible, we can have a look at ORM, GraphQL, etc for a high-level abstraction over SQL. But it may be overkill for substate replayer.)

alkorang commented 1 year ago

For compatiblity with other programming languages, encoding of substate must be something that other languages support and easy to use. In that sense, SubstateRLP and its legacy formats are hard to use in other languages. First, there must be a RLP library and definition of SubstateRLP and its legacy formats must be provided to properly decode RLP stream. SubstateJSON is flexible to add new fields. JSON is much easier than RLP to use in other languages because it is text format that other languages can read and convert to object/dict/map types in other languages.

SubstateJSON is currently too big because it stores bytecode instead of codehash. SubstateRLP stores codehash instead of code to keep it small. Maybe we need an option like "json_code" and "json_codehash" to choose to embed raw lenghty bytecode to SubstateJSON, or codehash to SubstateJSON and load (codehash, code) somewhere. But still, SubstateJSON is at least twice larger than SubstateRLP, because raw bytes should be stored as hex strings in JSON. This is a trade-off between compatiblity/portability with data size. If compression works very effectively, then DB size may become a minor concern.

alkorang commented 1 year ago

Performance Comparison of the Filesystem and Embedded Key-Value Databases Performance comparison between SQLite and several embedded KVDBs with various record sizes up to 10MiB and various numbers of records up to 1,000,000 records.

alkorang commented 1 year ago

sqlx has StructScan which will make it easier to scan an object from a row. https://github.com/jmoiron/sqlx

alkorang commented 1 year ago

Protocol Buffers (Protobuf) is a binary serialization library with several advantages compared to RLP.

Limitations

alkorang commented 1 year ago

TileDB supports filters (bit/byte shuffling, zstd compression) in cells and/or tiles in row-/col-major of dense arrays or sparse arrays. Supports multithreading and parallel I/O. It officially provides bindings of C, C++, Python, R, Java, Go, and C#. This may be better choice than the "reinvent-the-wheel" of compression support in SQLite3. TileDB: https://github.com/TileDB-Inc/TileDB TileDB-Go: https://github.com/TileDB-Inc/TileDB-Go

alkorang commented 1 year ago

When tested with https://pkg.go.dev/google.golang.org/protobuf package, proto.Marshal and proto.Unmarshal converts zero-length slice in golang (e.g. make([]byte, 0) or []bytes{}) to nil. This behavior is worth to know for optional behaviors with bytes of addresses and hashes, or bytes for big.Int zero value.

It is also worth to know difference between proto2 and proto3. https://www.hackingnote.com/en/versus/proto2-vs-proto3/

alkorang commented 8 months ago

rr0.4.0 now uses Protobuf instead of RLP for encoding substates.

alkorang commented 8 months ago

Erigon (previously named "Turbo-Geth") had similar issues and concerns in choosing the DB backend. Erigon chose LMDB over other DBs (link1). Erigon switched from LMDB to MDBX, a well-supported derivative of LMDB (link2).