Chain integration protocol specification - Reader program abstraction

abourget commented 11 months ago

These are specifications for enabling a chain to be supported by the Firehose and Substreams stack by StreamingFast.

It is a description of the reader node or program in the architecture diagram of Firehose (alternatively called extractors or firehose-enabled nodes).

This reader implementation does not presuppose any extraction method detailed here: https://firehose.streamingfast.io/integrate-new-chains/integration-overview

Once this is implemented for a chain, the Firehose history should be processable and a real-time Firehose can run live. Subtreams can also then be served on such a network.

This does not include any Substreams-specific extensions (like the Ethereum eth_call support), but allows for a generic use of firehose-core without modifications.

Program behavior

Input flags

This program is free to adopt any flags necessary.

Some implementations today use a direct node as the reader, often with one enabling the firehose stream of data specified herein (ex: --firehose-enabled).
Some implementations require only to point to an RPC node, from which it will pluck out the data (ex: --rpc-endpoint http://localhost:8545)

Output streams

The reader program should output data through standard UNIX output streams. It is conventional to use stdout for the block data, and stderr for any logs

Although it is not a hard requirement (the firehose-core stack can be configured to swap those two if needed).

A program that outputs a stream of blocks, in order, or out of order, encoded as bstream.Blocks.
- Let's tweak bstream.Block so it uses an any.Any field instead of our payload thing.
  - I don't want them to need to commit to bstream and us to bump bstream inside firehose-core each time a new integration comes.
  - Do the dbin tweak, with backwards compatibility? So files could convey their contents duly, with the fully specified
- So we ask them to OUTPUT that stream through stdout, with length prefixed, or line-based in B64, or with a defined method that would work across all protocols.
  - Perhaps two modes: binary and text-based.
  - Each line could start with FIRE, followed by a block number to help debugging, or even have the fields from bstream.Block on the line.
- Maybe we don't impose anything re stderr and stdout, but the reader can be booted with either stderr or stdout to pick-up the FIRE lines, anything not starting with FIRE would be ignored anyway.
- The order doesn't matter when reorgs happens, Firehose sorts them out, but:
  - the node must produce chained block_id and previous_id
  - blocks can come out of order, but be contiguous in terms of block_id/previosu_id in the end
  - never send blocks prior to a previously stated finalized block number
  - Finalized block number never goes backwards
  - contents of the any.Any fields must be deterministic across execution clients
    - error messages
    - timestamps that would use gettimeofday()
The underlying program can be written in any language. Our reader would then run that program and it does whatever it wants.
- Make sure node-manager now handles the above spec.
What about the necessary kill cycle when our node wants to make a backup?
- Could all of that logic be pushed outside of our stack, and if you want to make a backup, you stop firehose-chain altogether, do your thing, and then restart the whole process, and boot the program again?
  - Ideally, our stack doesn't manage people's infrastructure.
  - Keep it simple, maybe simplify what we've got?

FIRE INIT [READER_PROTOCOL_VERSION] sf.ethereum.type.v2.Block
FIRE [block_num:342342342] [block_hash] [parent_num] [parent_hash] [lib:123123123] B64ENCODEDChainSpecific.Block

READER_PROTOCOL_VERSION is the spec of those two lines here ^^ This version is 3.

maoueh commented 11 months ago

Any errors, sent to stderr

Most of the time stderr is used for logging purposes, how do you see error being conveyed there? Or maybe that's exactly how to see it, print/log errors to stderr?

aptos was one that had the inverse (logs through stdout), this created some issues, FYI.

So we ask them to OUTPUT that stream through stdout, with length prefixed, or line-based in B64

One thing I noted when doing Ethereum refactoring to a single line output was that having at least the block number somewhere on the line helped (a bit) with debugging since you can more easily see for which block number the line is.

The order doesn't matter, but all the specs:

Block cannot be <= to LIB

Implement the node-manager layer and deal with those chains as to how to do that layer. A few scripts? A small program?

I do not understand this one, aren't we define the spec needed for a program to be read by an agnostic reader-node app?

What about the necessary kill cycle when our node wants to make a backup?

That can be specified in the backup spec directly and probably makes even more sense there as one spec could need a stop and not there other one so the fact that a restart is needed feels to me that it should be part of the backup spec.

abourget commented 11 months ago

ok, addressed your comments above and updated the post, can you review? thanks for the input!

abourget commented 10 months ago

We should add:

in the reader, the meaning of a reader_version value should increase when the reader code has changed some meanings in the values, or improved its data extraction in a way that breaks with previous runs of that reader.
- this would allow a consumer to lock onto a version of the data
we also should have added a chain_id field at the top of our blocks, when there might be different networks.
- right now, we need to pass params (like Denis Carrière) does, for a Substreams to know which chain it is on.. as to tweak some code when on different networks .. would have been awesome to have a chain_id field within the blocks.
- if we decide to add it to Ethereum blocks, we could also go retrofit and fixup the merged blocks easliy, by appending that field with a fixed value (and we'd bump the reader_version :P)

abourget commented 10 months ago

Add?

FIRE INIT [READER_PROTOCOL_VERSION] [DATA_VERSION] sf.ethereum.type.v2.Block

Hmm.. the [DATA_VERSION] ought to be inserted within the type.v2.Block.. and have the payload carry the data version within. So we don't need to have it on the INIT line..

abourget commented 10 months ago

Do we have an expectation that upon restart, the reader program would continue where it left off? That's the behavior with geth right now.. it depends on the state under the reader, and it is not piloted by the node-manager stack.

maoueh commented 10 months ago

Do we have an expectation that upon restart, the reader program would continue where it left off? That's the behavior with geth right now.. it depends on the state under the reader, and it is not piloted by the node-manager stack.

That is correct, the "reader-node" program is expected to start back at the very next block.

sduchesneau commented 10 months ago

Reader-node starts at the very next block --> for readers that don't do this by design (ex: poller sucker), a recommendation would be to have a small "cursor" file on disk, that includes last block and LIB (so that the poller sucker will know if it needs to go back a few blocks because the last read block got reorged) That cursor file could be replaced by the user to contain a single block number, ex: echo 123455 > cursor and it would restart from there. simple, user controls his "node".

sduchesneau commented 10 months ago

Regarding backups and node-manager expectations:

backups:

not managed by node-manager!
the user will use pause/resume as helpers to achieve this.
We should use a sidecar for that (different binary)!
- The sidecar could poll prometheus stats to know what's the last processed block_number (or we could add small endpoint localhost:8080/v1/last_block_number on the manager http server)
other suggestion around pause/resume: we could add '--start-as-paused' and use it instead of startup-delay (isn't necessary for this endeavor, sorry for feature creep, I just felt it reinforced the idea of giving the user all power to manage its node workflow)

node-manager expectations:

no "start-block" detection bullshit (do it inside a shell script if you really need it, or use a cursor file)
it runs a command with arguments, that's it.
when stopping or doing pause/resume: it sends a SIGTERM and waits for the program to exit.
If the user needs some logic pre-start or pre-stop, it can give a "script" to run instead of binary+args, and handle whatever he wants in there, that's his option.

sduchesneau commented 10 months ago

Note Things that we "lose" by going down this road:

substreams-specific eth_calls (we could implement a small proxy for eth_calls that transforms it into a grpc protocol, so it gets decoded as protobuf, so we end up just enabling an endpoint for specific grpc protobuf message messages and it's all decoded within WASM)
indexes for firehose blocks (we could go without it if we have good filtering mechanisms inside substreams and people progressively replace all their firehose needs by using substreams mappers with these filters. I think we should be careful as to "when" we render existing firehose filters unusable by moving to a single magic binary thingy...)

matthewdarwin commented 10 months ago

That cursor file could be replaced by the user to contain a single block number, ex: echo 123455 > cursor and it would restart from there. simple, user controls his "node".

The arweave firehose poller-sucker works like this already with a small cursor file.

matthewdarwin commented 10 months ago

If the user needs some logic pre-start or pre-stop, it can give a "script" to run instead of binary+args, and handle whatever he wants in there, that's his option.

We do this all the time.

abourget commented 10 months ago

For those last two notes, here are strategies that allows an expansion and blockchain agnostic core to Substreams and Firehose:

https://github.com/streamingfast/substreams/issues/340 <- this I just published, please review, but we discussed it previously.
https://github.com/streamingfast/substreams/issues/322

abourget commented 10 months ago

The tool download-from-firehose needs a ToProto, because it extracts some values from in there, to reconstruct a bstream.Block, out of the data in the ...ethereum.type.v2.Block. So this will pose problem for generization for that particular tool, as it needs to understand that block.

abourget commented 10 months ago

To handle download-from-firehose, we need to add the parent_num (or parent_block_num) to bstream.Block.

Also, the sf.firehose.v2.Response, add a few fields contained within the bstream.Block as top-level. This way, the download-from-firehose can reconstruct the bstream.Block on the other end, in a generic way.

abourget commented 10 months ago

Layout of chain-specific repo:

Ex: streamingfast/firehose-bitcoin

README.md
- Reader in `nodeos --firehose-enabled`
- Proto def in solana/source-code/confirmed_blocks.proto

- Configuration to boot:
  firecore start --reader-node fireeth

firehose-bitcoin-btc-reader/main.go  VERSION
firehose-bitcoin-tools/main.go
pkg/polling-implementation.go
wasm-extensions/[eth_calls]  VERSION
substreams-crates/src/bitcoin-decoding-stuff.rs  VERSION
tools/[non-generic-tools]
substreams-explorer/
substreams.yaml -> `extract_firehose_blocks` produces an `.spkg`  VERSION
proto/sf/protocol/type/v1/block.proto VERSION

Releases can be in sync with the tag of the repo. Some releases might not include all of the pieces in there (maybe you don't release the spkg if the tools are updated with a new tag..)

abourget commented 10 months ago

Some work done to genericize:

https://github.com/streamingfast/proto/tree/block-payload-any
https://github.com/streamingfast/bstream/tree/better this branch contains a ChainConfig abstraction, that isn't strictly necessary for our purpose right now, but the bstream.Block takes in the proto change.
The dbin repository, in the develop branch has some Reverted commits that implemented the protobuf fully qualified message name instead of some random three letters word. Is this necessary?

abourget commented 10 months ago

We can remove the payload_version from the bstream.Block, and remove the checks for acceptance of that version.

We have decided to go with type.v1 and type.v2 when bumping largely the versions of the Ethereum Block for instance. And we have a Ver within that Ethereum block, for knowing the content revision, taken from the FIRE INIT, and interpreted by the Reader, or simply produced with a certain revision of the reader node (hard-coded in the reader version when it acts differently).

abourget commented 10 months ago

This will allow stats to be gathered at the level of the READER layer, and bring that into firecore. We can extract the throughput metrics from firehose-ethereum and bring them back into a generic firecore.

streamingfast / firehose-core