Closed quentinlesceller closed 6 years ago
Looks like an issue in the generation of the header locator, possible a regression after @garyyu optimization? Just taking a few lines out of the logs (didn't know you could do 22MB attachments on github btw):
Sep 13 14:32:40.936 DEBG sync_state: sync_status: BodySync { current_height: 93760, highest_height: 93766 } -> HeaderSync { current_height: 93760, highest_height: 93766 }
So the node is at 93760 and needs to get 6 blocks to sync up to 93766. However the generated locator looks like this:
Sep 13 14:32:40.934 DEBG sync: locator heights': [93130, 93128, 93124, 93116, 93100, 93068, 93004, 92876, 92620, 92108, 91084, 89036, 84940, 76748, 60364, 27596, 0]
Sep 13 14:32:40.934 DEBG sync: history locators: len=17, shrunk=0
Sep 13 14:32:40.935 DEBG sync: locator: [014a6cc3, 082b95aa, 07588dd0, 04002d2b, 006f15f9, 010464cb, 0337b8f3, 0216ed75, 029d2984, 01ab762a, 0af59373, 0476c5b7, 08806be1, 1527d2ff, 12a566dd, 0e418688, 006642e0]
Instead of starting at 93760, the locator starts at 93130, which is too far back. So other nodes send us headers we already have. There could be another issue in addition as we logs already show a lot of:
Sep 13 14:32:51.000 INFO All known, ignoring
Normally, even if we ask at 93130, we should still receive the next 36 headers, 30 of them would get discarded but the last 6 would help. Unfortunately grinscan.net looks down at the moment which complicates lookup by short hashes from the logs.
This looks like something is wrong in the sync head:
Sep 13 14:32:39.885 DEBG sync: locator heights : [93130, 93128, 93124, 93116, 93100, 93068, 93004, 92876, 92620, 92108, 91084, 89036, 84940, 76748, 60364, 27596, 0]
Sep 13 14:32:40.934 DEBG sync: locator heights': [93130, 93128, 93124, 93116, 93100, 93068, 93004, 92876, 92620, 92108, 91084, 89036, 84940, 76748, 60364, 27596, 0]
https://github.com/mimblewimble/grin/blob/master/servers/src/grin/sync.rs#L519-L520
let tip = chain.get_sync_head()?;
let heights = get_locator_heights(tip.height); <<<< 1st locator is the Tip
The highest_height
is 93766 but the let tip = chain.get_sync_head()
is 93130.
will look into the logs detail today.
The only possibility of this happening is the sync head in the database indeed is at height 93130!
@quentinlesceller If the scene is still kept, or anyone see this problem again, please execute the following command and attach the output and the grin-server.log
together, thanks~
mdb_dump -a ~/.grin/chain_data/lmdb | awk '/^ 73$/{nr[NR]; nr[NR+1]}; NR in nr'
If don't have mdb_dump
command, please install it by brew install mdb_dump
or something like that.
Uploaded a full grin-server.log where it happened just earlier today http://lesceller.com/misc/grin-server.log.
Thanks for the full log. Root cause is clear:
Sep 13 03:45:18.005 DEBG sync_state: sync_status: BodySync { current_height: 93125, highest_height: 93130 } -> NoSync
...
Sep 13 03:47:14.978 DEBG Received block 0843d35d at 93131 from 109.74.202.16:13414, inputs: 0, outputs: 1, kernels: 1, going to process.
Sep 13 03:47:14.979 DEBG pipe: process_block 0843d35d at 93131 with 0 inputs, 1 outputs, 1 kernels
Sep 13 03:47:15.007 DEBG pipe: chain head 0843d35d @ 93131
...
<same for 93133 to 93766, from the peers block broadcasting>
...
Sep 13 14:32:30.968 DEBG sync_state: sync_status: NoSync -> HeaderSync { current_height: 93760, highest_height: 93766 }
Sep 13 14:32:35.969 DEBG body_sync: body_head - 004adb1d, 93760, header_head - 004adb1d, 93760, sync_head - 014a6cc3, 93130
Sep 13 14:32:39.885 DEBG sync: locator heights : [93130, 93128, 93124, 93116, 93100, 93068, 93004, 92876, 92620, 92108, 91084, 89036, 84940, 76748, 60364, 27596, 0]
Sep 13 14:32:40.977 DEBG body_sync: body_head - 004adb1d, 93760, header_head - 004adb1d, 93760, sync_head - 014a6cc3, 93130
On the NoSync
state, we got the blocks from the peers' broadcasting, and update header head (but no update the sync head I guess, need confirm later). Later, somehow the NoSync
state switch to HeaderSync
because highest_height - current_height >= 6
, then problem happen! the sync head is still at 93130
but actually we already have all the headers from 93131 to 93760!
Let me think about how to fix this.
Mmmh yes, this isn't an easy fix.
I want to fix it by this solution:
sync.rs
, allow header_sync()
always execute regardless the state of SyncStatus
, even on NoSync
statepipe.rs
, update_head()
function, removing the else{}
part.
if ctx.opts.contains(Options::SYNC) {
batch
.save_body_head(&tip)
.map_err(|e| ErrorKind::StoreErr(e, "pipe save body".to_owned()))?;
} else { <<<< Here! Removing the `else{}`
batch
.save_head(&tip)
.map_err(|e| ErrorKind::StoreErr(e, "pipe save head".to_owned()))?;
}
Does this make sense? @ignopeverell
Update:
Removing the else{}
part of update_head()
function is a bad idea! Because that means a redundant downloading of the Headers even they are already downloaded (in the block).
I change to another solution: allow the headers_received()
to update the sync head even all received headers are "All known, ignoring". Then we still can keep the sync head
in the original meaning and always that meaning.
The side effect (could not real side
effect) is: on NoSync
state, when we're getting blocks by peers block broadcasting instead of from syncing, the header head will grow, but the sync head will leave behind and no touched. Until a new syncing is started, new Headers received and sync head will be updated again.
Please see the code detail in PR https://github.com/mimblewimble/grin/pull/1531
Sep 13 14:32:30.968 DEBG sync_state: sync_status: NoSync -> HeaderSync { current_height: 93760, highest_height: 93766 }
Sep 13 14:32:35.969 DEBG body_sync: body_head - 004adb1d, 93760, header_head - 004adb1d, 93760, sync_head - 014a6cc3, 93130
So yeah, this isn't a regression, this is actually a known issue that we've struggled with in the past.
During header sync we get headers in chunks of ~500 headers at a time.
And if sync_head
gets more than 500 headers out of date (and does not get reset) then we get stuck in a state where we have seen all 500 headers and nothing progresses in the sync.
Is one possible solution not to simply reset sync_head
to match header_head
when we transition from NoSync -> HeaderSync
?
Sep 13 14:32:30.968 DEBG sync_state: sync_status: NoSync -> HeaderSync { current_height: 93760, highest_height: 93766 }
So that we see something along the lines of -
Sep 13 14:32:35.969 DEBG body_sync: body_head - 004adb1d, 93760, header_head - 004adb1d, 93760, sync_head - 004adb1d, 93760
Instead of (out of date sync_head
) -
Sep 13 14:32:35.969 DEBG body_sync: body_head - 004adb1d, 93760, header_head - 004adb1d, 93760, sync_head - 014a6cc3, 93130
Is one possible solution not to simply reset sync_head to match header_head when we transition from NoSync -> HeaderSync?
Yes, indeed it's more simple and efficient solution to fix this problem, and it can avoid the duplicated downloading of all these all known
headers. But are you really sure we can do like this?
I thought it before but I switched to current solution in PR #1531. Because I'm not sure it's good to mix the meaning of sync_head
with header_head
. That's why I let it go to request the Headers, and check the received Headers to decide if we can update sync_head
. Maybe this is more safe solution?
Oh, look like the difference is trivial in the safety for both solutions:
For example in this #1524 case, the sync_head
is 93130 and header_head
is 93760.
sync_head
as 93130, we will send Headers request with locators: [93130, 93128, 93124, 93116, 93100, 93068, 93004, 92876, 92620, 92108, 91084, 89036, 84940, 76748, 60364, 27596, 0]; sync_head
as 93760, we sill send request with locators: [93760, (-2), (-4), (-8), ..., 93250, 92108, 91084, 89036, 84940, 76748, 60364, 27596, 0]; (not exact, but something like this).Let's go with @antiochp 's solution. Will revise the code in PR #1531.
Thanks for the code review and discussion 👍
Closed by #1531
Attaching the
grin-server.log
grin-server.log