Question: why does `firehose` need access to one-block and merged-block storage?

xJonathanLEI commented 2 years ago

If this data-flow diagram from this page is accurate, the firehose component should only need access to relayer(s), who would in turn fetch data from upstream stroage (and live readers):

spaces_VgApTPXzd7Z9BUUosaeF_uploads_git-blob-b5c0fbef4f90e1584583cd7eda306d063a25306e_general_architecture

However, it looks like the firehose component is requesting storage URLs:

https://github.com/streamingfast/firehose-acme/blob/6966e1a3aaf49d2d398686333967299e97bde05b/cmd/fireacme/cli/firehose.go#L108-L109

Just in case it's requested but not used in the actual code, I tried pointing them to an empty folder, but then the progress stops functioning. So it seems like firehose indeed needs access to those. Is the diagram outdated or am I misunderstanding something here? Much thanks in advance!

xJonathanLEI commented 2 years ago

Also related:

The diagram states that the relayer connects to both the readers and the merged block stores. However, the actual code seems to indicate that the relayer doesn't really care about merged blocks:

https://github.com/streamingfast/firehose-acme/blob/6966e1a3aaf49d2d398686333967299e97bde05b/cmd/fireacme/cli/relayer.go#L27-L32

In fact, it only cares about reader gRPC URLs and one-block storage. Nothing about merged blocks has been requested.

From what I've found so far, it looks like only the front-end process firehose is dealing with merged blocks, not the relayers, meaning that the diagram is indeed outdated. Is that correct?

xJonathanLEI commented 2 years ago

Actually, another thing that further confuses me is: why would the relayer need access to the one-block storage, if it:

already has direct access to reader nodes via gRPC; and
its only job is just to relay the block stream (given that the assumption where it doesn't handle merged blocks is true).

Thanks a lot!

maoueh commented 2 years ago

Just in case it's requested but not used in the actual code, I tried pointing them to an empty folder, but then the progress stops functioning. So it seems like firehose indeed needs access to those. Is the diagram outdated or am I misunderstanding something here? Much thanks in advance!

The diagram is outdated indeed since recently (I would say about a month or so). We refactored a bunch of internals to remove forked blocks from merged-blocks as well as improving how we bridge "live" segment of chain (one-blocks) and historical segment (merged blocks). Now components that needs to do this bridging accesses both one block store and merged blocks store.

The diagram states that the relayer connects to both the readers and the merged block stores. However, the actual code seems to indicate that the relayer doesn't really care about merged blocks:

Indeed, no merged blocks is accessed in the relayer. The diagram is still correct (but could be clearer) however because one blocks are stored in the object store so it does access (but probably that it should be split in two object stores one for one block and one for merged blocks to make it more precise).

While you are right the live blocks are coming from "reader" node, they are however a "hot" source of blocks meaning that blocks are simply broadcasted on the gRPC connection as they are read by the reader node. If a relayer disconnects for example for 30 seconds, then on resume it would have "miss" a few blocks. Fetching from one blocks in those case will be used to fill the holes the relayer have not seen which make it faster to become ready instead of waiting to receive more live blocks from the reader node(s).

xJonathanLEI commented 2 years ago

That's very clear. Thanks a lot!

streamingfast / firehose-acme

Question: why does `firehose` need access to one-block and merged-block storage? #5