Closed garyyu closed 6 years ago
So we start and see this -
...before we start... 08523a3e at 114132
Then we see this -
sync_head: d5ae8ed3 at 114132, reset to: d5ae8ed3 at 114132
So there's a fork at height 114132
and that's the exact height we're starting up at.
I bet we're just looking at height here somewhere and not taking total work into consideration during chain init.
Node is at 08523a3e at 114132
.
sync_head is currently d5ae8ed3 at 114132
We receive the headers from 001a5c5b
onward and then decide All known, ignoring
.
What do we do in this situation if we ignore all headers?
(Rhetorical question I guess - we just keep asking for the same headers over and over).
We have code in place to handle a badly timed shutdown during block processing.
But I don't think we have anything similar in place for a header sync.
We may be in the middle of processing a batch of headers, storing them in the db, but we have not yet updated the header_head
(or the sync_head
).
On restart we request headers based on a locator built from the (now stale) sync_head.
And we fail quickly with a All known, ignoring
as these headers are in the db.
And if this is more than 511 headers ago we cannot fix this in a single batch of headers and the sync_head will never get updated, so we end up in this infinite loop, asking for the same headers again and again.
@garyyu I think you suggested updating the sync_head
if we encountered an All known, ignoring
during header sync?
This may be a fix for this situation.
If we receive 511 headers during header sync and they are "All known" then we should update header_head
and sync_head
to the last header (the 511th one) and continue.
What I don't understand is how or where we updated header_head
to an earlier (with less total work) header if we know "later" headers (with presumably more work).
I think the only place we could do this is during chain::init()
if we discover our current head (block chain itself) is somehow invalid and we rewind back through full blocks to find a point where the chain fully validates.
@antiochp agree with you. You want me to try this fixing or do it by yourself?
@antiochp regarding:
Node is at 08523a3e at 114132.
Could you please explain more about this? I'm confused why this happen. Indeed the 08523a3e at 114132
is on the main chain, but at this time both sync_head
and header_head
are d5ae8ed3
at 114132.
chain: init: rewinding and validating before we start... 08523a3e at 114132
How this init
know 08523a3e at 114132
?
The only possibility which I can imagine is: the node indeed received all those blocks (from 114133 to 114643), but failed to update the header_head
. So it will be interesting to see the logs before Sep 30 13:46:31.923
.
@bergr01 please 😄
The fix solution in #1619 is just a workaround, but surely it make sense to have it. And still need find the root cause if possible.
According to @bergr01 's new attached log for time before Sep 30 13:46:31.923
:
https://dumpz.org/bRgex8chsm9F
Sep 30 12:10:06.480 DEBG chain: txhashset_write: finished validating and rebuilding
Sep 30 12:10:06.480 DEBG sync_state: sync_status: TxHashsetValidation { kernels: 119554, kernel_total: 119555, rproofs: 86307, rproof_total: 128057 } -> TxHashsetSave
...
Sep 30 12:11:17.232 DEBG monitor_peers: no preferred peers
<<<< Note: there's no any log between 17" and 29", I guess the process is killed by signal -9.
Sep 30 12:11:29.820 INFO Using configuration file at /Users/gregberesnev/.grin/grin-server.toml
Sep 30 12:11:29.820 INFO This is Grin version 0.3.0 (git 0.0.0-pre.0.3.0-6-ga13c20c), built for x86_64-apple-darwin by rustc 1.28.0 (9634041f0 2018-07-30).
...
Sep 30 12:11:29.821 INFO Starting server, genesis block: 006642e0
Sep 30 12:11:29.942 DEBG prune_list: bitmap 18562 pos (23182 bytes), pruned_cache 64924 pos (27548 bytes), shift_cache 18562, leaf_shift_cache 18562
Sep 30 12:11:30.062 DEBG prune_list: bitmap 18562 pos (23182 bytes), pruned_cache 64924 pos (27548 bytes), shift_cache 18562, leaf_shift_cache 18562
Sep 30 12:11:30.062 DEBG Error returned, discarding txhashset extension: Invalid Root
Cause: Unknown
Backtrace:
Sep 30 12:11:30.064 DEBG chain: init: rewinding and validating before we start... 08523a3e at 114132
Sep 30 12:11:30.067 DEBG Chain init: 48412341 @ 114132 [08523a3e]
...
Sep 30 12:12:01.197 DEBG sync: initial transition to HeaderSync. sync_head: af3ba04e at 117124, reset to: d5ae8ed3 at 114132
Sep 30 12:12:01.198 DEBG sync_state: sync_status: Initial -> HeaderSync { current_height: 114132, highest_height: 117129 }
Sep 30 12:12:01.198 DEBG sync: locator heights : [114132, 114130, 114126, 114118, 114102, 114070, 114006, 113878, 113622, 113110, 112086, 110038, 105942, 97750, 81366, 48598, 0]
Sep 30 12:12:02.395 DEBG sync: locator heights': [114132, 114130, 114126, 114118, 114102, 114070, 114006, 113878, 113622, 113110, 112086, 110038, 105942, 97750, 81366, 48598, 0]
Sep 30 12:12:02.396 DEBG sync: history locators: len=17, shrunk=0
Sep 30 12:12:02.396 DEBG sync: locator: [08523a3e, 0f19b6aa, 11a87413, 012d53fd, 1b96fc73, 05f336ad, 07c60dbe, 090c0fa1, 01aa9c11, 08ed2acb, 06a9397c, 0789d465, 09fb91e5, 00b7d789, 0658ea2d, 050cb927, 006642e0]
Sep 30 12:12:02.396 DEBG sync: request_headers: asking 204.48.26.36:13414 for headers, [08523a3e, 0f19b6aa, 11a87413, 012d53fd, 1b96fc73, 05f336ad, 07c60dbe, 090c0fa1, 01aa9c11, 08ed2acb, 06a9397c, 0789d465, 09fb91e5, 00b7d789, 0658ea2d, 050cb927, 006642e0]
Sep 30 12:12:02.644 INFO Received block headers [001a5c5b, 0aca3e50, 0620a819, 113d1b8f, 14054cba, 0cf9e257, 1bcc7ad3, 0947664b] from 204.48.26.36:13414
Sep 30 12:12:02.644 INFO All known, ignoring
I see the procedure was:
-9
, which is definitely not proposed.sync_head: af3ba04e at 117124, reset to: d5ae8ed3 at 114132
, means header_head
is at the fork (but not deep fork, just one block), still OK.sync_state: sync_status: Initial -> HeaderSync { current_height: 114132, highest_height: 117129 }
, still OK.sync: locator heights : [114132, ...
still OK.sync: locator: [08523a3e, ...
problem here! Since we already reset sync_head
to d5ae8ed3 at 114132
, why we get 08523a3e
here! I think this is the root cause.@antiochp could you help to look? I'm lost in this impossible happening:
fn do_header_sync(
...
debug!(
LOGGER,
"sync: initial transition to HeaderSync. sync_head: {} at {}, reset to: {} at {}",
sync_head.hash(),
sync_head.height,
header_head.hash(),
header_head.height,
);
chain.init_sync_head(&header_head).unwrap();
above chain.init_sync_head
just reset sync_head
to d5ae8ed3 at 114132
, but following chain.get_sync_head()
still get 08523a3e at 114132
! How come!
fn get_locator(
...
let tip = chain.get_sync_head()?;
You "cache" the locator in history_locators
- is this involved somehow in getting the incorrect block hashes? We're not reading them from the db.
We read the tip at height x, but we have a cached block hash for that height and use that instead?
You "cache" the locator in history_locators - is this involved somehow in getting the incorrect block hashes?
No. I called history_locators.clear()
exactly after chain.init_sync_head()
.
chain.init_sync_head(&header_head).unwrap();
history_locators.clear();
The only possible explain for this incorrect block hash: database corrupted?
:woman_facepalming: Looks like there are some surprises in store.init_sync_head()
...
I don't think it is doing what we necessarily thought it was doing.
See #1624 to cleanup init_sync_head()
weirdness.
Gary Yu @garyyu 21:44 I don’t understand, what’s the difference between self.store.get_header_head() and self.get_header_head()? @antiochp
Antioch Peverell @antiochp 21:45 .store hits the db directly, ignoring the current db tx (which may have uncommitted updates in it) we don't want to do that from within a batch or we're going to read inconsistent data but the main issue was reset_sync_head() was ignoring the head passed in if there was already a header_head in the db, and just defaulted to using the header_head from the db this is what it used to do - https://github.com/mimblewimble/grin/blob/6ad54b98407d0e80d335498a6848135ed1ea8b30/chain/src/store.rs#L209-L219
Gary Yu @garyyu 21:50 Oops! Now I understand~
Antioch Peverell @antiochp 21:50 it used to be used exclusively in chain::init() where that made more sense just on first time initialization but now we (recently) started using it in sync
Looks like another regression. Somehow the header head is inconsistent with the store:
Check hash
001a5c5b
in https://grinscan.net/block/114133 is correct. So the problem is the Header head.I suspect this regression is caused by recent refactor since 26th Sep.