Merger stuck at block 31000

paymog commented 2 years ago

It seems that my merger is now stuck at block 31000. I see the following log line emitted over and over

firehose-mainnet-6f8bbfbd75-gcwpc firehose {"severity":"WARNING","timestamp":"2022-10-21T15:44:42.420585258-04:00","logger":"bstream","message":"too many consecutive unlinkable blocks","block":"#109903 (b4a118481e568b1c)","consecutive_unlinkable_blocks":77100,"last_block_sent":"#31272 (00689cb504900f40)","logging.googleapis.com/labels":{}}

and here's a more complete sample of log lines:

firehose-mainnet-6f8bbfbd75-gcwpc firehose {"severity":"INFO","timestamp":"2022-10-21T15:44:36.322716469-04:00","logger":"index-builder","message":"reading from blocks store: file does not (yet?) exist, retrying in","filename":"/data/firehose/storage/merged-blocks/0000031000.dbin.zst","base_filename":"0000031000","retry_delay":4,"logging.googleapis.com/labels":{}}
firehose-mainnet-6f8bbfbd75-gcwpc firehose {"severity":"INFO","timestamp":"2022-10-21T15:44:37.068411556-04:00","logger":"reader.geth","message":"forkchoice requested sync to new head    number=15,798,650 hash=d179b5..c50f7d","logging.googleapis.com/labels":{}}
firehose-mainnet-6f8bbfbd75-gcwpc firehose {"severity":"INFO","timestamp":"2022-10-21T15:44:40.322901399-04:00","logger":"index-builder","message":"reading from blocks store: file does not (yet?) exist, retrying in","filename":"/data/firehose/storage/merged-blocks/0000031000.dbin.zst","base_filename":"0000031000","retry_delay":4,"logging.googleapis.com/labels":{}}
firehose-mainnet-6f8bbfbd75-gcwpc firehose {"severity":"INFO","timestamp":"2022-10-21T15:44:40.834577546-04:00","logger":"reader.geth","message":"imported new chain segment               blocks=71   txs=389  mgas=8.229  elapsed=8.235s    mgasps=0.999  number=369,258    hash=f8be77..e6c618 age=7y1mo2w  dirty=5.43MiB","logging.googleapis.com/labels":{}}
firehose-mainnet-6f8bbfbd75-gcwpc firehose {"severity":"WARNING","timestamp":"2022-10-21T15:44:42.420585258-04:00","logger":"bstream","message":"too many consecutive unlinkable blocks","block":"#109903 (b4a118481e568b1c)","consecutive_unlinkable_blocks":77100,"last_block_sent":"#31272 (00689cb504900f40)","logging.googleapis.com/labels":{}}
firehose-mainnet-6f8bbfbd75-gcwpc firehose {"severity":"INFO","timestamp":"2022-10-21T15:44:44.323103322-04:00","logger":"index-builder","message":"reading from blocks store: file does not (yet?) exist, retrying in","filename":"/data/firehose/storage/merged-blocks/0000031000.dbin.zst","base_filename":"0000031000","retry_delay":4,"logging.googleapis.com/labels":{}}
firehose-mainnet-6f8bbfbd75-gcwpc firehose {"severity":"INFO","timestamp":"2022-10-21T15:44:48.323232598-04:00","logger":"index-builder","message":"reading from blocks store: file does not (yet?) exist, retrying in","filename":"/data/firehose/storage/merged-blocks/0000031000.dbin.zst","base_filename":"0000031000","retry_delay":4,"logging.googleapis.com/labels":{}}
firehose-mainnet-6f8bbfbd75-gcwpc firehose {"severity":"INFO","timestamp":"2022-10-21T15:44:48.996961958-04:00","logger":"reader.geth","message":"imported new chain segment               blocks=804  txs=708  mgas=15.726 elapsed=8.115s    mgasps=1.938  number=370,062    hash=266bdd..3b7689 age=7y1mo2w  dirty=4.67MiB","logging.googleapis.com/labels":{}}
firehose-mainnet-6f8bbfbd75-gcwpc firehose {"severity":"INFO","timestamp":"2022-10-21T15:44:49.077330751-04:00","logger":"reader.geth","message":"forkchoice requested sync to new head    number=15,798,651 hash=31b649..2c1d7c","logging.googleapis.com/labels":{}}
firehose-mainnet-6f8bbfbd75-gcwpc firehose {"severity":"WARNING","timestamp":"2022-10-21T15:44:50.162217455-04:00","logger":"bstream","message":"too many consecutive unlinkable blocks","block":"#110003 (7b770ad57aa913a8)","consecutive_unlinkable_blocks":77200,"last_block_sent":"#31272 (00689cb504900f40)","logging.googleapis.com/labels":{}}

Are the unlinkable blocks an issue? If so, how can I resolve them?

maoueh commented 2 years ago

So, this warning can be sometimes a false positive when a lot of blocks are ingested in a short amount of time, which is the case when reprocessing a chain from scratch. It's a real problem only when consecutive unlinkable blocks increased forever.

It's possible that it's a true positive also. An unclean shutdown of firehose-ethereum can be the cause of a hole in the one blocks. Even clean shutdown, if there is a bug somewhere in firehose-ethereum can cause a hole, we've seen and fixed such bug in the past, could still exist.

To resolve a one-block hole, you need to "rewind" the geth instance back in time to "reprocess" the missing block, how you can achieve that:

Attach to instance and use debug.setHead(<0x value of a Integer>)
Restore from backup of Geth data.

To reduce the chance of having hole, deploying 2 reader nodes syncing the chain together both writing to shared one-block store greatly reduce the chance that a hole is created.

Have you solve your problem?

paymog commented 2 years ago

ah got it! I ended up wiping the data directory and starting from scratch which seemed to fix it. Good to know there's a way to resolve this with geth. I guess in this specific case, I should use a block prior to 31000 when doing the rewind. Is that correct? Is it better to go back further in time?

If I have two reader nodes sharing the same one-block store, should I make sure they don't share the same data-dir? eg: have two reader node pods (in k8s) with independent volumes for the datadir but use the same s3 bucket for the one-block, merged, forked and index dirs?

maoueh commented 2 years ago

Is that correct? Is it better to go back further in time?

Yeah you are correct, should probably use like hole - 500 as the input, it's safe to go back a bit in time. So early in the chain however, setHead actually start back from genesis anyway, so it's easier to simply restart.

If I have two reader nodes sharing the same one-block store, should I make sure they don't share the same data-dir?

Exact, they have they own data dir so they both sync with the chain, but as you said, they both write to the same one block store, the merger is intelligent and will delete older files that are not required.

When using this setup, you can configure the flag reader-node-oneblock-suffix differently for both reader-node, this one some storage engine. Setting the flag make reader-node write independent files representing the same block, it helps deals with problem of read/write consistency on the same filename we've seen in the past on some storage solution, like Ceph. The merger is handles those correctly.

paymog commented 2 years ago

Awesome, thank you for all the help. I'll try to make a PR in the next couple days to document all my learnings so other devs have an easier time getting things up and running.

maoueh commented 2 years ago

@seanmooretechwriter Is our technical writer, pretty sure he could also make a few adjustments by looking at all the conversation we had over the past weeks :)

paymog commented 2 years ago

Oh right! @seanmooretechwriter I'd be happy to chat (over text/audio/video) if you'd like to get an overview of what I've learned - hopefully that will make writing docs a bit easier 🙂

seanmooretechwriter commented 2 years ago

Hi @paymog

Good to hear from you!

Ok, this sounds like a great idea. Please let me know what your availability looks like and we can set up a time and day to chat and go over all of this. I'll read through more of this Git issue prior to our chat.

Thank you a ton for your time and the awesome and valuable feedback! :)

cc: @maoueh

seanmooretechwriter commented 2 years ago

The call went well and we were able to gather useful and insightful input for areas of improvement for the Firehose documentation.

Alex provided great notes encapsulating the input and we've identified a few areas to address in the very near future.

streamingfast / firehose-ethereum

Merger stuck at block 31000 #38