streamingfast / firehose-core

Firehose Integrators Tool Kit (for `firehose-<chain>` maintainers)
Apache License 2.0
8 stars 9 forks source link

Chain integration protocol specification - Reader program abstraction #17

Closed abourget closed 9 months ago

abourget commented 11 months ago

These are specifications for enabling a chain to be supported by the Firehose and Substreams stack by StreamingFast.

It is a description of the reader node or program in the architecture diagram of Firehose (alternatively called extractors or firehose-enabled nodes).

This reader implementation does not presuppose any extraction method detailed here: https://firehose.streamingfast.io/integrate-new-chains/integration-overview

Once this is implemented for a chain, the Firehose history should be processable and a real-time Firehose can run live. Subtreams can also then be served on such a network.

This does not include any Substreams-specific extensions (like the Ethereum eth_call support), but allows for a generic use of firehose-core without modifications.

Program behavior

Input flags

This program is free to adopt any flags necessary.

Output streams

The reader program should output data through standard UNIX output streams. It is conventional to use stdout for the block data, and stderr for any logs

Although it is not a hard requirement (the firehose-core stack can be configured to swap those two if needed).

FIRE INIT [READER_PROTOCOL_VERSION] sf.ethereum.type.v2.Block
FIRE [block_num:342342342] [block_hash] [parent_num] [parent_hash] [lib:123123123] B64ENCODEDChainSpecific.Block

READER_PROTOCOL_VERSION is the spec of those two lines here ^^ This version is 3.

maoueh commented 11 months ago

Any errors, sent to stderr

Most of the time stderr is used for logging purposes, how do you see error being conveyed there? Or maybe that's exactly how to see it, print/log errors to stderr?

aptos was one that had the inverse (logs through stdout), this created some issues, FYI.

So we ask them to OUTPUT that stream through stdout, with length prefixed, or line-based in B64

One thing I noted when doing Ethereum refactoring to a single line output was that having at least the block number somewhere on the line helped (a bit) with debugging since you can more easily see for which block number the line is.

The order doesn't matter, but all the specs:

Block cannot be <= to LIB

Implement the node-manager layer and deal with those chains as to how to do that layer. A few scripts? A small program?

I do not understand this one, aren't we define the spec needed for a program to be read by an agnostic reader-node app?

What about the necessary kill cycle when our node wants to make a backup?

That can be specified in the backup spec directly and probably makes even more sense there as one spec could need a stop and not there other one so the fact that a restart is needed feels to me that it should be part of the backup spec.

abourget commented 11 months ago

ok, addressed your comments above and updated the post, can you review? thanks for the input!

abourget commented 10 months ago

We should add:

abourget commented 10 months ago

Add?

FIRE INIT [READER_PROTOCOL_VERSION] [DATA_VERSION] sf.ethereum.type.v2.Block

Hmm.. the [DATA_VERSION] ought to be inserted within the type.v2.Block.. and have the payload carry the data version within. So we don't need to have it on the INIT line..

abourget commented 10 months ago

Do we have an expectation that upon restart, the reader program would continue where it left off? That's the behavior with geth right now.. it depends on the state under the reader, and it is not piloted by the node-manager stack.

maoueh commented 10 months ago

Do we have an expectation that upon restart, the reader program would continue where it left off? That's the behavior with geth right now.. it depends on the state under the reader, and it is not piloted by the node-manager stack.

That is correct, the "reader-node" program is expected to start back at the very next block.

sduchesneau commented 10 months ago

Reader-node starts at the very next block --> for readers that don't do this by design (ex: poller sucker), a recommendation would be to have a small "cursor" file on disk, that includes last block and LIB (so that the poller sucker will know if it needs to go back a few blocks because the last read block got reorged) That cursor file could be replaced by the user to contain a single block number, ex: echo 123455 > cursor and it would restart from there. simple, user controls his "node".

sduchesneau commented 10 months ago

Regarding backups and node-manager expectations:

backups:

node-manager expectations:

sduchesneau commented 10 months ago

Note Things that we "lose" by going down this road:

matthewdarwin commented 10 months ago

That cursor file could be replaced by the user to contain a single block number, ex: echo 123455 > cursor and it would restart from there. simple, user controls his "node".

The arweave firehose poller-sucker works like this already with a small cursor file.

matthewdarwin commented 10 months ago
  • If the user needs some logic pre-start or pre-stop, it can give a "script" to run instead of binary+args, and handle whatever he wants in there, that's his option.

We do this all the time.

abourget commented 10 months ago

For those last two notes, here are strategies that allows an expansion and blockchain agnostic core to Substreams and Firehose:

abourget commented 10 months ago

The tool download-from-firehose needs a ToProto, because it extracts some values from in there, to reconstruct a bstream.Block, out of the data in the ...ethereum.type.v2.Block. So this will pose problem for generization for that particular tool, as it needs to understand that block.

abourget commented 10 months ago

To handle download-from-firehose, we need to add the parent_num (or parent_block_num) to bstream.Block.

Also, the sf.firehose.v2.Response, add a few fields contained within the bstream.Block as top-level. This way, the download-from-firehose can reconstruct the bstream.Block on the other end, in a generic way.

abourget commented 10 months ago

Layout of chain-specific repo:

Ex: streamingfast/firehose-bitcoin

README.md
- Reader in `nodeos --firehose-enabled`
- Proto def in solana/source-code/confirmed_blocks.proto

- Configuration to boot:
  firecore start --reader-node fireeth

firehose-bitcoin-btc-reader/main.go  VERSION
firehose-bitcoin-tools/main.go
pkg/polling-implementation.go
wasm-extensions/[eth_calls]  VERSION
substreams-crates/src/bitcoin-decoding-stuff.rs  VERSION
tools/[non-generic-tools]
substreams-explorer/
substreams.yaml -> `extract_firehose_blocks` produces an `.spkg`  VERSION
proto/sf/protocol/type/v1/block.proto VERSION

Releases can be in sync with the tag of the repo. Some releases might not include all of the pieces in there (maybe you don't release the spkg if the tools are updated with a new tag..)

abourget commented 10 months ago

Some work done to genericize:

abourget commented 10 months ago

We can remove the payload_version from the bstream.Block, and remove the checks for acceptance of that version.

We have decided to go with type.v1 and type.v2 when bumping largely the versions of the Ethereum Block for instance. And we have a Ver within that Ethereum block, for knowing the content revision, taken from the FIRE INIT, and interpreted by the Reader, or simply produced with a certain revision of the reader node (hard-coded in the reader version when it acts differently).

abourget commented 10 months ago

This will allow stats to be gathered at the level of the READER layer, and bring that into firecore. We can extract the throughput metrics from firehose-ethereum and bring them back into a generic firecore.