mimblewimble / grin

Minimal implementation of the Mimblewimble protocol.
https://grin.mw/
Apache License 2.0
5.04k stars 991 forks source link

Node stuck in sync loop #1524

Closed quentinlesceller closed 6 years ago

quentinlesceller commented 6 years ago

Attaching the grin-server.log grin-server.log

ignopeverell commented 6 years ago

Looks like an issue in the generation of the header locator, possible a regression after @garyyu optimization? Just taking a few lines out of the logs (didn't know you could do 22MB attachments on github btw):

Sep 13 14:32:40.936 DEBG sync_state: sync_status: BodySync { current_height: 93760, highest_height: 93766 } -> HeaderSync { current_height: 93760, highest_height: 93766 }

So the node is at 93760 and needs to get 6 blocks to sync up to 93766. However the generated locator looks like this:

Sep 13 14:32:40.934 DEBG sync: locator heights': [93130, 93128, 93124, 93116, 93100, 93068, 93004, 92876, 92620, 92108, 91084, 89036, 84940, 76748, 60364, 27596, 0]
Sep 13 14:32:40.934 DEBG sync: history locators: len=17, shrunk=0
Sep 13 14:32:40.935 DEBG sync: locator: [014a6cc3, 082b95aa, 07588dd0, 04002d2b, 006f15f9, 010464cb, 0337b8f3, 0216ed75, 029d2984, 01ab762a, 0af59373, 0476c5b7, 08806be1, 1527d2ff, 12a566dd, 0e418688, 006642e0]

Instead of starting at 93760, the locator starts at 93130, which is too far back. So other nodes send us headers we already have. There could be another issue in addition as we logs already show a lot of:

Sep 13 14:32:51.000 INFO All known, ignoring

Normally, even if we ask at 93130, we should still receive the next 36 headers, 30 of them would get discarded but the last 6 would help. Unfortunately grinscan.net looks down at the moment which complicates lookup by short hashes from the logs.

garyyu commented 6 years ago

This looks like something is wrong in the sync head:

Sep 13 14:32:39.885 DEBG sync: locator heights : [93130, 93128, 93124, 93116, 93100, 93068, 93004, 92876, 92620, 92108, 91084, 89036, 84940, 76748, 60364, 27596, 0]
Sep 13 14:32:40.934 DEBG sync: locator heights': [93130, 93128, 93124, 93116, 93100, 93068, 93004, 92876, 92620, 92108, 91084, 89036, 84940, 76748, 60364, 27596, 0]

https://github.com/mimblewimble/grin/blob/master/servers/src/grin/sync.rs#L519-L520

    let tip = chain.get_sync_head()?;
    let heights = get_locator_heights(tip.height);    <<<< 1st locator is the Tip

The highest_height is 93766 but the let tip = chain.get_sync_head() is 93130.

will look into the logs detail today.

garyyu commented 6 years ago

The only possibility of this happening is the sync head in the database indeed is at height 93130!

@quentinlesceller If the scene is still kept, or anyone see this problem again, please execute the following command and attach the output and the grin-server.log together, thanks~

mdb_dump -a ~/.grin/chain_data/lmdb | awk '/^ 73$/{nr[NR]; nr[NR+1]}; NR in nr'

If don't have mdb_dump command, please install it by brew install mdb_dump or something like that.

quentinlesceller commented 6 years ago

Uploaded a full grin-server.log where it happened just earlier today http://lesceller.com/misc/grin-server.log.

garyyu commented 6 years ago

Thanks for the full log. Root cause is clear:

Sep 13 03:45:18.005 DEBG sync_state: sync_status: BodySync { current_height: 93125, highest_height: 93130 } -> NoSync
...
Sep 13 03:47:14.978 DEBG Received block 0843d35d at 93131 from 109.74.202.16:13414, inputs: 0, outputs: 1, kernels: 1, going to process.
Sep 13 03:47:14.979 DEBG pipe: process_block 0843d35d at 93131 with 0 inputs, 1 outputs, 1 kernels
Sep 13 03:47:15.007 DEBG pipe: chain head 0843d35d @ 93131
...
<same for 93133 to 93766, from the peers block broadcasting>
...
Sep 13 14:32:30.968 DEBG sync_state: sync_status: NoSync -> HeaderSync { current_height: 93760, highest_height: 93766 }
Sep 13 14:32:35.969 DEBG body_sync: body_head - 004adb1d, 93760, header_head - 004adb1d, 93760, sync_head - 014a6cc3, 93130

Sep 13 14:32:39.885 DEBG sync: locator heights : [93130, 93128, 93124, 93116, 93100, 93068, 93004, 92876, 92620, 92108, 91084, 89036, 84940, 76748, 60364, 27596, 0]
Sep 13 14:32:40.977 DEBG body_sync: body_head - 004adb1d, 93760, header_head - 004adb1d, 93760, sync_head - 014a6cc3, 93130

On the NoSync state, we got the blocks from the peers' broadcasting, and update header head (but no update the sync head I guess, need confirm later). Later, somehow the NoSync state switch to HeaderSync because highest_height - current_height >= 6, then problem happen! the sync head is still at 93130 but actually we already have all the headers from 93131 to 93760!

Let me think about how to fix this.

ignopeverell commented 6 years ago

Mmmh yes, this isn't an easy fix.

garyyu commented 6 years ago

I want to fix it by this solution:

  1. In sync.rs, allow header_sync() always execute regardless the state of SyncStatus, even on NoSync state
  2. In pipe.rs, update_head() function, removing the else{} part.
        if ctx.opts.contains(Options::SYNC) {
            batch
                .save_body_head(&tip)
                .map_err(|e| ErrorKind::StoreErr(e, "pipe save body".to_owned()))?;
        } else {          <<<< Here!  Removing the `else{}`
            batch
                .save_head(&tip)
                .map_err(|e| ErrorKind::StoreErr(e, "pipe save head".to_owned()))?;
        }

Does this make sense? @ignopeverell

garyyu commented 6 years ago

Update: Removing the else{} part of update_head() function is a bad idea! Because that means a redundant downloading of the Headers even they are already downloaded (in the block).

I change to another solution: allow the headers_received() to update the sync head even all received headers are "All known, ignoring". Then we still can keep the sync head in the original meaning and always that meaning.

The side effect (could not real side effect) is: on NoSync state, when we're getting blocks by peers block broadcasting instead of from syncing, the header head will grow, but the sync head will leave behind and no touched. Until a new syncing is started, new Headers received and sync head will be updated again.

Please see the code detail in PR https://github.com/mimblewimble/grin/pull/1531

antiochp commented 6 years ago
Sep 13 14:32:30.968 DEBG sync_state: sync_status: NoSync -> HeaderSync { current_height: 93760, highest_height: 93766 }
Sep 13 14:32:35.969 DEBG body_sync: body_head - 004adb1d, 93760, header_head - 004adb1d, 93760, sync_head - 014a6cc3, 93130

So yeah, this isn't a regression, this is actually a known issue that we've struggled with in the past. During header sync we get headers in chunks of ~500 headers at a time. And if sync_head gets more than 500 headers out of date (and does not get reset) then we get stuck in a state where we have seen all 500 headers and nothing progresses in the sync.

Is one possible solution not to simply reset sync_head to match header_head when we transition from NoSync -> HeaderSync?

Sep 13 14:32:30.968 DEBG sync_state: sync_status: NoSync -> HeaderSync { current_height: 93760, highest_height: 93766 }

So that we see something along the lines of -

Sep 13 14:32:35.969 DEBG body_sync: body_head - 004adb1d, 93760, header_head - 004adb1d, 93760, sync_head - 004adb1d, 93760

Instead of (out of date sync_head) -

Sep 13 14:32:35.969 DEBG body_sync: body_head - 004adb1d, 93760, header_head - 004adb1d, 93760, sync_head - 014a6cc3, 93130
garyyu commented 6 years ago

Is one possible solution not to simply reset sync_head to match header_head when we transition from NoSync -> HeaderSync?

Yes, indeed it's more simple and efficient solution to fix this problem, and it can avoid the duplicated downloading of all these all known headers. But are you really sure we can do like this?

I thought it before but I switched to current solution in PR #1531. Because I'm not sure it's good to mix the meaning of sync_head with header_head. That's why I let it go to request the Headers, and check the received Headers to decide if we can update sync_head. Maybe this is more safe solution?

garyyu commented 6 years ago

Oh, look like the difference is trivial in the safety for both solutions:

For example in this #1524 case, the sync_head is 93130 and header_head is 93760.

Let's go with @antiochp 's solution. Will revise the code in PR #1531.

Thanks for the code review and discussion 👍

garyyu commented 6 years ago

Closed by #1531