sync stuck and infinite "All known, ignoring" log messages

garyyu commented 6 years ago

bergr01 @bergr01 10:13 after fetching latest code and rebuilding server, after starting it up I am seeing repeated 'Received block headers' followed by 'All known, ignoring', is that normal/expected? (as a result current status has been sitting at Downloading headers: 97% for several minutes now)

Gary Yu @garyyu 10:17 share the grin-server.log file to me please, I will check it.

bergr01 @bergr01 11:51 there you go @garyyu https://pastebin.com/09LcRRkw

Looks like another regression. Somehow the header head is inconsistent with the store:

Sep 30 13:46:31.923 DEBG chain: init: rewinding and validating before we start... 08523a3e at 114132
...
Sep 30 13:47:03.057 DEBG sync: initial transition to HeaderSync. sync_head: d5ae8ed3 at 114132, reset to: d5ae8ed3 at 114132
Sep 30 13:47:03.058 DEBG sync_state: sync_status: Initial -> HeaderSync { current_height: 114132, highest_height: 117233 }
Sep 30 13:47:03.058 DEBG sync: locator heights : [114132, 114130, 114126, 114118, 114102, 114070, 114006, 113878, 113622, 113110, 112086, 110038, 105942, 97750, 81366, 48598, 0]
Sep 30 13:47:04.237 DEBG sync: locator: [08523a3e, 0f19b6aa, 11a87413, 012d53fd, 1b96fc73, 05f336ad, 07c60dbe, 090c0fa1, 01aa9c11, 08ed2acb, 06a9397c, 0789d465, 09fb91e5, 00b7d789, 0658ea2d, 050cb927, 006642e0]
...
Sep 30 13:47:04.482 INFO Received block headers [001a5c5b, 0aca3e50, 0620a819, 113d1b8f, 14054cba, 0cf9e257, 1bcc7ad3, 0947664b] from 204.48.26.36:13414
Sep 30 13:47:04.483 INFO All known, ignoring

Check hash 001a5c5b in https://grinscan.net/block/114133 is correct. So the problem is the Header head.

I suspect this regression is caused by recent refactor since 26th Sep.

antiochp commented 6 years ago

So we start and see this -

...before we start... 08523a3e at 114132

Then we see this -

sync_head: d5ae8ed3 at 114132, reset to: d5ae8ed3 at 114132

So there's a fork at height 114132 and that's the exact height we're starting up at.

I bet we're just looking at height here somewhere and not taking total work into consideration during chain init.

antiochp commented 6 years ago

Node is at 08523a3e at 114132. sync_head is currently d5ae8ed3 at 114132

We receive the headers from 001a5c5b onward and then decide All known, ignoring.

What do we do in this situation if we ignore all headers?

(Rhetorical question I guess - we just keep asking for the same headers over and over).

We have code in place to handle a badly timed shutdown during block processing. But I don't think we have anything similar in place for a header sync. We may be in the middle of processing a batch of headers, storing them in the db, but we have not yet updated the header_head (or the sync_head).

On restart we request headers based on a locator built from the (now stale) sync_head. And we fail quickly with a All known, ignoring as these headers are in the db.

And if this is more than 511 headers ago we cannot fix this in a single batch of headers and the sync_head will never get updated, so we end up in this infinite loop, asking for the same headers again and again.

@garyyu I think you suggested updating the sync_head if we encountered an All known, ignoring during header sync? This may be a fix for this situation. If we receive 511 headers during header sync and they are "All known" then we should update header_head and sync_head to the last header (the 511th one) and continue.

What I don't understand is how or where we updated header_head to an earlier (with less total work) header if we know "later" headers (with presumably more work). I think the only place we could do this is during chain::init() if we discover our current head (block chain itself) is somehow invalid and we rewind back through full blocks to find a point where the chain fully validates.

garyyu commented 6 years ago

@antiochp agree with you. You want me to try this fixing or do it by yourself?

garyyu commented 6 years ago

@antiochp regarding:

Node is at 08523a3e at 114132.

Could you please explain more about this? I'm confused why this happen. Indeed the 08523a3e at 114132 is on the main chain, but at this time both sync_head and header_head are d5ae8ed3 at 114132.

chain: init: rewinding and validating before we start... 08523a3e at 114132

How this init know 08523a3e at 114132?

garyyu commented 6 years ago

The only possibility which I can imagine is: the node indeed received all those blocks (from 114133 to 114643), but failed to update the header_head. So it will be interesting to see the logs before Sep 30 13:46:31.923.

@bergr01 please 😄

garyyu commented 6 years ago

The fix solution in #1619 is just a workaround, but surely it make sense to have it. And still need find the root cause if possible.

According to @bergr01 's new attached log for time before Sep 30 13:46:31.923: https://dumpz.org/bRgex8chsm9F

Sep 30 12:10:06.480 DEBG chain: txhashset_write: finished validating and rebuilding
Sep 30 12:10:06.480 DEBG sync_state: sync_status: TxHashsetValidation { kernels: 119554, kernel_total: 119555, rproofs: 86307, rproof_total: 128057 } -> TxHashsetSave
...
Sep 30 12:11:17.232 DEBG monitor_peers: no preferred peers
    <<<< Note: there's no any log between 17" and 29", I guess the process is killed by signal -9.

Sep 30 12:11:29.820 INFO Using configuration file at /Users/gregberesnev/.grin/grin-server.toml
Sep 30 12:11:29.820 INFO This is Grin version 0.3.0 (git 0.0.0-pre.0.3.0-6-ga13c20c), built for x86_64-apple-darwin by rustc 1.28.0 (9634041f0 2018-07-30).
...

Sep 30 12:11:29.821 INFO Starting server, genesis block: 006642e0
Sep 30 12:11:29.942 DEBG prune_list: bitmap 18562 pos (23182 bytes), pruned_cache 64924 pos (27548 bytes), shift_cache 18562, leaf_shift_cache 18562
Sep 30 12:11:30.062 DEBG prune_list: bitmap 18562 pos (23182 bytes), pruned_cache 64924 pos (27548 bytes), shift_cache 18562, leaf_shift_cache 18562
Sep 30 12:11:30.062 DEBG Error returned, discarding txhashset extension: Invalid Root
 Cause: Unknown
 Backtrace:
Sep 30 12:11:30.064 DEBG chain: init: rewinding and validating before we start... 08523a3e at 114132
Sep 30 12:11:30.067 DEBG Chain init: 48412341 @ 114132 [08523a3e]
...
Sep 30 12:12:01.197 DEBG sync: initial transition to HeaderSync. sync_head: af3ba04e at 117124, reset to: d5ae8ed3 at 114132
Sep 30 12:12:01.198 DEBG sync_state: sync_status: Initial -> HeaderSync { current_height: 114132, highest_height: 117129 }
Sep 30 12:12:01.198 DEBG sync: locator heights : [114132, 114130, 114126, 114118, 114102, 114070, 114006, 113878, 113622, 113110, 112086, 110038, 105942, 97750, 81366, 48598, 0]
Sep 30 12:12:02.395 DEBG sync: locator heights': [114132, 114130, 114126, 114118, 114102, 114070, 114006, 113878, 113622, 113110, 112086, 110038, 105942, 97750, 81366, 48598, 0]
Sep 30 12:12:02.396 DEBG sync: history locators: len=17, shrunk=0
Sep 30 12:12:02.396 DEBG sync: locator: [08523a3e, 0f19b6aa, 11a87413, 012d53fd, 1b96fc73, 05f336ad, 07c60dbe, 090c0fa1, 01aa9c11, 08ed2acb, 06a9397c, 0789d465, 09fb91e5, 00b7d789, 0658ea2d, 050cb927, 006642e0]
Sep 30 12:12:02.396 DEBG sync: request_headers: asking 204.48.26.36:13414 for headers, [08523a3e, 0f19b6aa, 11a87413, 012d53fd, 1b96fc73, 05f336ad, 07c60dbe, 090c0fa1, 01aa9c11, 08ed2acb, 06a9397c, 0789d465, 09fb91e5, 00b7d789, 0658ea2d, 050cb927, 006642e0]
Sep 30 12:12:02.644 INFO Received block headers [001a5c5b, 0aca3e50, 0620a819, 113d1b8f, 14054cba, 0cf9e257, 1bcc7ad3, 0947664b] from 204.48.26.36:13414
Sep 30 12:12:02.644 INFO All known, ignoring

I see the procedure was:

fast sync stuck and TUI freeze because of bug https://github.com/mimblewimble/grin/issues/1611
somebody kill the grin process by signal -9, which is definitely not proposed.
somebody restart the grin.
sync_head: af3ba04e at 117124, reset to: d5ae8ed3 at 114132, means header_head is at the fork (but not deep fork, just one block), still OK.
sync_state: sync_status: Initial -> HeaderSync { current_height: 114132, highest_height: 117129 }, still OK.
sync: locator heights : [114132, ... still OK.
sync: locator: [08523a3e, ... problem here! Since we already reset sync_head to d5ae8ed3 at 114132, why we get 08523a3e here! I think this is the root cause.

garyyu commented 6 years ago

@antiochp could you help to look? I'm lost in this impossible happening:

fn do_header_sync(
            ...
            debug!(
                LOGGER,
                "sync: initial transition to HeaderSync. sync_head: {} at {}, reset to: {} at {}",
                sync_head.hash(),
                sync_head.height,
                header_head.hash(),
                header_head.height,
            );
            chain.init_sync_head(&header_head).unwrap();

above chain.init_sync_head just reset sync_head to d5ae8ed3 at 114132, but following chain.get_sync_head() still get 08523a3e at 114132 ! How come!

fn get_locator(
    ...
    let tip = chain.get_sync_head()?;

antiochp commented 6 years ago

You "cache" the locator in history_locators - is this involved somehow in getting the incorrect block hashes? We're not reading them from the db.

We read the tip at height x, but we have a cached block hash for that height and use that instead?

garyyu commented 6 years ago

You "cache" the locator in history_locators - is this involved somehow in getting the incorrect block hashes?

No. I called history_locators.clear() exactly after chain.init_sync_head().

            chain.init_sync_head(&header_head).unwrap();
            history_locators.clear();

The only possible explain for this incorrect block hash: database corrupted?

antiochp commented 6 years ago

:woman_facepalming: Looks like there are some surprises in store.init_sync_head()...

I don't think it is doing what we necessarily thought it was doing.

antiochp commented 6 years ago

See #1624 to cleanup init_sync_head() weirdness.

garyyu commented 6 years ago

Gary Yu @garyyu 21:44 I don’t understand, what’s the difference between self.store.get_header_head() and self.get_header_head()? @antiochp

Antioch Peverell @antiochp 21:45 .store hits the db directly, ignoring the current db tx (which may have uncommitted updates in it) we don't want to do that from within a batch or we're going to read inconsistent data but the main issue was reset_sync_head() was ignoring the head passed in if there was already a header_head in the db, and just defaulted to using the header_head from the db this is what it used to do - https://github.com/mimblewimble/grin/blob/6ad54b98407d0e80d335498a6848135ed1ea8b30/chain/src/store.rs#L209-L219

Gary Yu @garyyu 21:50 Oops! Now I understand~

Antioch Peverell @antiochp 21:50 it used to be used exclusively in chain::init() where that made more sense just on first time initialization but now we (recently) started using it in sync

mimblewimble / grin

sync stuck and infinite "All known, ignoring" log messages #1612

bergr01 @bergr01 10:13 after fetching latest code and rebuilding server, after starting it up I am seeing repeated 'Received block headers' followed by 'All known, ignoring', is that normal/expected? (as a result current status has been sitting at Downloading headers: 97% for several minutes now)

Gary Yu @garyyu 10:17 share the grin-server.log file to me please, I will check it.

bergr01 @bergr01 11:51 there you go @garyyu https://pastebin.com/09LcRRkw

Gary Yu @garyyu 21:44 I don’t understand, what’s the difference between self.store.get_header_head() and self.get_header_head()? @antiochp

Gary Yu @garyyu 21:50 Oops! Now I understand~