Open alkorang opened 1 year ago
For compatiblity with other programming languages, encoding of substate must be something that other languages support and easy to use. In that sense, SubstateRLP and its legacy formats are hard to use in other languages. First, there must be a RLP library and definition of SubstateRLP and its legacy formats must be provided to properly decode RLP stream. SubstateJSON is flexible to add new fields. JSON is much easier than RLP to use in other languages because it is text format that other languages can read and convert to object/dict/map types in other languages.
SubstateJSON is currently too big because it stores bytecode instead of codehash. SubstateRLP stores codehash instead of code to keep it small. Maybe we need an option like "json_code" and "json_codehash" to choose to embed raw lenghty bytecode to SubstateJSON, or codehash to SubstateJSON and load (codehash, code) somewhere. But still, SubstateJSON is at least twice larger than SubstateRLP, because raw bytes should be stored as hex strings in JSON. This is a trade-off between compatiblity/portability with data size. If compression works very effectively, then DB size may become a minor concern.
Performance Comparison of the Filesystem and Embedded Key-Value Databases Performance comparison between SQLite and several embedded KVDBs with various record sizes up to 10MiB and various numbers of records up to 1,000,000 records.
sqlx has StructScan
which will make it easier to scan an object from a row.
https://github.com/jmoiron/sqlx
Protocol Buffers (Protobuf) is a binary serialization library with several advantages compared to RLP.
protoc
compiles .proto
definition files to C, C++, Java, Python, Go, etc. The generated Go code looks clean enough to load and store transaction substates.optional
for access lists and gas fees, oneof
for message data/initcodehash or account code/codehash. There will be more fields to add such as EIP-4844 which is implemented and being tested in the latest Geth. https://eips.ethereum.org/EIPS/eip-4844Limitations
map
type, but its key must be scalar types except bytes
. This means string
can be used as a key type, which is similar to JSON's limitation. RLP has no map support at all. But it seems just a lack of support in encoder and decoder, not a problem of time and space. Actually, Protobuf map
is serialized as repeated
of key-value pairs called entries, so it may be cleaner to define define maps manually for bytes
keys. https://protobuf.dev/programming-guides/encoding/#mapsTileDB supports filters (bit/byte shuffling, zstd compression) in cells and/or tiles in row-/col-major of dense arrays or sparse arrays. Supports multithreading and parallel I/O. It officially provides bindings of C, C++, Python, R, Java, Go, and C#. This may be better choice than the "reinvent-the-wheel" of compression support in SQLite3. TileDB: https://github.com/TileDB-Inc/TileDB TileDB-Go: https://github.com/TileDB-Inc/TileDB-Go
When tested with https://pkg.go.dev/google.golang.org/protobuf package, proto.Marshal
and proto.Unmarshal
converts zero-length slice in golang (e.g. make([]byte, 0)
or []bytes{}
) to nil. This behavior is worth to know for optional behaviors with bytes
of addresses and hashes, or bytes
for big.Int
zero value.
It is also worth to know difference between proto2 and proto3. https://www.hackingnote.com/en/versus/proto2-vs-proto3/
There are several reasons to switch from goleveldb to a new database backend.
We usually do a batch analysis of a range of blocks or filter transactions with their types. For these cases, KVDB like goleveldb, pebble, rocksdb may not be an ideal solution. So we can have a look at other database types like RDBMS. The new database must be local files and must support read-only mode for multithreading/multiprocessing like goleveldb. Database compression is not essential because we know major part of DB is substate_rlp and code, so we can compress them before inserting them into DB.
sqlite3 is one candidate of a new database backend. If we use sqlite, we should manually compress substate_rlp and code. sqlite uses a single file for a single database, so our recorder and replayer should partition dbs in proper size. zstd with dictionary is very fast and efficient for many small rows. gozstd supports zstd dictionary. The dictionary data is required for correct compression and decompression, so it must be stored in DB. It may be better to keep recorder to put substate_rlp as is and make
substate-cli db compact
to compresses substate_rlp and code.(If it is feasible, we can have a look at ORM, GraphQL, etc for a high-level abstraction over SQL. But it may be overkill for substate replayer.)