mimblewimble / grin

Minimal implementation of the Mimblewimble protocol.
https://grin.mw/
Apache License 2.0
5.04k stars 991 forks source link

node falls behind and needs restart to catch up #1581

Closed bladedoyle closed 6 years ago

bladedoyle commented 6 years ago

My archive node often gets "stuck". Restarting the grin server process fixes it. No ERRO messages in the log. But there is this which shows the problem (see current_height vs highest height):

`Sep 23 02:40:01.178 DEBG body_sync: body_head - 0e286867, 106953, header_head - 0ae10456, 106950, sync_head - 0e286867, 106953

Sep 23 02:40:01.179 DEBG sync_state: sync_status: BodySync { current_height: 106953, highest_height: 107266 } -> BodySync { current_height: 106953, highest_height: 107267 }

Sep 23 02:40:06.186 DEBG body_sync: body_head - 0e286867, 106953, header_head - 0ae10456, 106950, sync_head - 0e286867, 106953`

and

`Sep 23 02:39:58.565 DEBG pipe: process_block_header at 107267 [0787e127]

Sep 23 02:39:58.565 DEBG Block header 0787e127 refused by chain: Orphan`

garyyu commented 6 years ago

Interesting! looks still has bug here. Could you give the full log file?

garyyu commented 6 years ago

Look like I just reproduce it:

Sep 23 11:31:45.537 DEBG sync_state: sync_status: BodySync { current_height: 105911, highest_height: 107339 } -> HeaderSync { current_height: 107338, highest_height: 107339 }
...
Sep 23 11:32:14.867 DEBG sync_state: sync_status: BodySync { current_height: 107293, highest_height: 107339 } -> NoSync
...
(almost 19 minutes later)
...
Sep 23 13:51:03.285 INFO sync: total_difficulty 46767414, peer_difficulty 46769073, threshold 1634 (last 5 blocks), enabling sync
Sep 23 13:51:03.285 DEBG body_sync: body_head - 0b9c81e5, 107494, header_head - 0b9c81e5, 107494, sync_head - 042bc8c6, 107339
Sep 23 13:51:03.285 DEBG sync_state: sync_status: NoSync -> BodySync { current_height: 107494, highest_height: 107499 }
Sep 23 13:51:05.287 DEBG sync_state: sync_status: BodySync { current_height: 107494, highest_height: 107499 } -> HeaderSync { current_height: 107494, highest_height: 107499 }
Sep 23 13:51:05.290 DEBG sync: locator heights : [107339, 107337, 107333, 107325, 107309, 107277, 107213, 107085, 106829, 106317, 105293, 103245, 99149, 90957, 74573, 41805, 0]
    <<<< problem here!  still using sync_head at 107339 instead of header_head 107494

Then, the remaining thing is almost same as https://github.com/mimblewimble/grin/issues/1524, all received headers are "All known, ignoring".

The root cause is: We will reset sync_head to header_head on initial transition to HeaderSync, but somehow when we transit from NoSync, the next first state is BodySync instead of HeaderSync!

Sep 23 13:51:03.285 DEBG sync_state: sync_status: NoSync -> BodySync { current_height: 107494, highest_height: 107499 }

Looking into function header_sync_due(), there was a condition:

        let all_headers_received =
            header_head.height >= prev_height + (p2p::MAX_BLOCK_HEADERS as u64) - 4;

        let stalling = header_head.height == latest_height && now > timeout;
        if all_headers_received || stalling {
            ...
            true
        }

Which make this header_sync_due() definitely return false IF all_headers_received is false, when enabling sync. That's why when we transit from NoSync the next first state is BodySync instead of HeaderSync!

A fix solution is: always enable header sync on initial state transition from NoSync / Initial. will give a PR on this fixing.

garyyu commented 6 years ago

Closed by #1582