substate database compatibily, portability, and optimization for batch processing

alkorang commented 1 year ago

There are several reasons to switch from goleveldb to a new database backend.

Geth 1.11 switched from goleveldb to pebble, and pebble becomes default db engine in Geth 1.12. Geth 1.11 release note says why they switched from goleveldb to another database backend.
When I tested, goleveldb is incompatible with the official leveldb library from Google and wappers for other programming languages. I must use Go language with goleveldb even for a simple inspection like how many transactions are in the substate database.

We usually do a batch analysis of a range of blocks or filter transactions with their types. For these cases, KVDB like goleveldb, pebble, rocksdb may not be an ideal solution. So we can have a look at other database types like RDBMS. The new database must be local files and must support read-only mode for multithreading/multiprocessing like goleveldb. Database compression is not essential because we know major part of DB is substate_rlp and code, so we can compress them before inserting them into DB.

sqlite3 is one candidate of a new database backend. If we use sqlite, we should manually compress substate_rlp and code. sqlite uses a single file for a single database, so our recorder and replayer should partition dbs in proper size. zstd with dictionary is very fast and efficient for many small rows. gozstd supports zstd dictionary. The dictionary data is required for correct compression and decompression, so it must be stored in DB. It may be better to keep recorder to put substate_rlp as is and make substate-cli db compact to compresses substate_rlp and code.

(If it is feasible, we can have a look at ORM, GraphQL, etc for a high-level abstraction over SQL. But it may be overkill for substate replayer.)

alkorang commented 1 year ago

For compatiblity with other programming languages, encoding of substate must be something that other languages support and easy to use. In that sense, SubstateRLP and its legacy formats are hard to use in other languages. First, there must be a RLP library and definition of SubstateRLP and its legacy formats must be provided to properly decode RLP stream. SubstateJSON is flexible to add new fields. JSON is much easier than RLP to use in other languages because it is text format that other languages can read and convert to object/dict/map types in other languages.

SubstateJSON is currently too big because it stores bytecode instead of codehash. SubstateRLP stores codehash instead of code to keep it small. Maybe we need an option like "json_code" and "json_codehash" to choose to embed raw lenghty bytecode to SubstateJSON, or codehash to SubstateJSON and load (codehash, code) somewhere. But still, SubstateJSON is at least twice larger than SubstateRLP, because raw bytes should be stored as hex strings in JSON. This is a trade-off between compatiblity/portability with data size. If compression works very effectively, then DB size may become a minor concern.

alkorang commented 1 year ago

Performance Comparison of the Filesystem and Embedded Key-Value Databases Performance comparison between SQLite and several embedded KVDBs with various record sizes up to 10MiB and various numbers of records up to 1,000,000 records.

alkorang commented 1 year ago

sqlx has StructScan which will make it easier to scan an object from a row. https://github.com/jmoiron/sqlx

alkorang commented 1 year ago

Protocol Buffers (Protobuf) is a binary serialization library with several advantages compared to RLP.

Speed, vs. JSON: Obviously, Protobuf is faster than JSON
Speed, vs. RLP: 2x faster than RLP, see https://github.com/Fantom-foundation/go-lachesis/issues/158
Compatibility: protoc compiles .proto definition files to C, C++, Java, Python, Go, etc. The generated Go code looks clean enough to load and store transaction substates.
Extensibility: optional for access lists and gas fees, oneof for message data/initcodehash or account code/codehash. There will be more fields to add such as EIP-4844 which is implemented and being tested in the latest Geth. https://eips.ethereum.org/EIPS/eip-4844

Limitations

Protobuf has no fixed-size array type for bytes for 20-byte addresses and 32-byte values (hash and integer). But RLP and JSON also do not have those fixed-size array types. It must be checked in the application level, not protocol.
Protobuf has map type, but its key must be scalar types except bytes. This means string can be used as a key type, which is similar to JSON's limitation. RLP has no map support at all. But it seems just a lack of support in encoder and decoder, not a problem of time and space. Actually, Protobuf map is serialized as repeated of key-value pairs called entries, so it may be cleaner to define define maps manually for bytes keys. https://protobuf.dev/programming-guides/encoding/#maps
Protobuf message size limit is 2GB, must check whether there is a transaction substate that exceeds this limit. The current strategy with RLP that saves codehashes in substates and load code from database will relief this limit with Protobuf, too. https://protobuf.dev/programming-guides/encoding/#size-limit

alkorang commented 1 year ago

TileDB supports filters (bit/byte shuffling, zstd compression) in cells and/or tiles in row-/col-major of dense arrays or sparse arrays. Supports multithreading and parallel I/O. It officially provides bindings of C, C++, Python, R, Java, Go, and C#. This may be better choice than the "reinvent-the-wheel" of compression support in SQLite3. TileDB: https://github.com/TileDB-Inc/TileDB TileDB-Go: https://github.com/TileDB-Inc/TileDB-Go

alkorang commented 1 year ago

When tested with https://pkg.go.dev/google.golang.org/protobuf package, proto.Marshal and proto.Unmarshal converts zero-length slice in golang (e.g. make([]byte, 0) or []bytes{}) to nil. This behavior is worth to know for optional behaviors with bytes of addresses and hashes, or bytes for big.Int zero value.

It is also worth to know difference between proto2 and proto3. https://www.hackingnote.com/en/versus/proto2-vs-proto3/

alkorang commented 8 months ago

rr0.4.0 now uses Protobuf instead of RLP for encoding substates.

alkorang commented 8 months ago

Erigon (previously named "Turbo-Geth") had similar issues and concerns in choosing the DB backend. Erigon chose LMDB over other DBs (link1). Erigon switched from LMDB to MDBX, a well-supported derivative of LMDB (link2).

verovm / record-replay

substate database compatibily, portability, and optimization for batch processing #2