mimblewimble / grin

Minimal implementation of the Mimblewimble protocol.
https://grin.mw/
Apache License 2.0
5.04k stars 989 forks source link

Allow restarting grin during a sync, and avoid loosing a lot of time (work) #755

Closed sesam closed 6 years ago

sesam commented 6 years ago

Grin forgets sync/header head progress between restarts. If we instead remembered such progress between restarts, long syncs would be easier to complete.

Currently grin always starts with header and sync set to the chain's body head. Instead heads could be saved in rocksdb periodically during sync.

Once sync is completed, saved header/sync should be removed, to make sure that on next restart we fall back to default behavoir of starting from the chain's body head.

antiochp commented 6 years ago

We definitely already store both heads in the rocksdb index. These are the prefixes we use to do this - https://github.com/mimblewimble/grin/blob/a9d1b76414f981ec2543ba321c0201afccc12150/chain/src/store.rs#L32-L34

I'm also pretty sure we maintain these across restarts. At least the header_head. Maybe we reset the sync head on restart (I don't recall)?

Are we seeing a slow header sync?

sesam commented 6 years ago

Nice! I'm not sure how I should debug this:

Mar 08 15:00:19.145 INFO header head 517652591 @ 76500 [04f86ea6]
Mar 08 15:00:19.145 INFO sync head 517652591 @ 76500 [04f86ea6]
Mar 08 15:00:19.305 INFO header head 517738443 @ 76600 [2a65c8e3]
Mar 08 15:00:19.305 INFO sync head 517738443 @ 76600 [2a65c8e3]
Mar 08 15:00:19.377 INFO header head 517772458 @ 76700 [aa0539cc]
Mar 08 15:00:19.377 INFO sync head 517772458 @ 76700 [aa0539cc]
Mar 08 15:00:19.398 INFO Added 511 headers to the header chain.
Mar 08 15:00:19.427 INFO Client 139.59.105.169:13414 connection lost: Connection(Error { repr: Custom(Custom { kind: UnexpectedEof, error: StringError("early eof") }) })
^C

LAPTOP:t1-profile-less s$ ./grin
Mar 08 15:00:25.987 INFO Using configuration file at: ~/mw/t1-profile-less/grin.toml
Mar 08 15:00:25.989 INFO Starting the Grin server from configuration file at ~/mw/t1-profile-less/grin.toml
Mar 08 15:00:26.029 INFO Chain init: Tip { height: 75724, last_block_h: b32ed040, prev_block_h: 3bd4f35a, total_difficulty: Difficulty { num: 517431947 } }
Mar 08 15:00:26.096 WARN P2P server started on 127.0.0.1:15414
Mar 08 15:00:26.096 INFO Starting rest apis at: 127.0.0.1:15413
Mar 08 15:00:26.096 WARN Grin server started.
Mar 08 15:00:26.097 INFO Starting HTTP API server at 127.0.0.1:15413.
Mar 08 15:00:32.776 INFO Client ... connection lost: Timeout
Mar 08 15:00:43.973 INFO Client .... connection lost: Connection(Error { repr: Custom(Custom { kind: UnexpectedEof, error: StringError("early eof") }) })
Mar 08 15:00:58.207 INFO Received block headers [bc8938c0, 14d0c9c7, 95133ddb, 514133d8, ....
Mar 08 15:00:58.258 INFO header head 517442624 @ 75800 [71b458f9]
Mar 08 15:00:58.258 INFO sync head 517442624 @ 75800 [71b458f9]
Mar 08 15:00:58.325 INFO header head 517468077 @ 75900 [54714a88]
Mar 08 15:00:58.326 INFO sync head 517468077 @ 75900 [54714a88]
Mar 08 15:00:58.402 INFO header head 517545016 @ 76000 [0db2c381]

After stopping the server, the head jumped back to where my body chain's head was at.

sesam commented 6 years ago

I added debug output on the functions, and the sync and header head get overwritten during start:

Mar 08 16:50:38.457 INFO Starting the Grin server from configuration file at ~/mw/t1-profile-less/grin.toml Mar 08 16:50:38.563 INFO save header head Tip { height: 75730, last_block_h: 64789d2c, prev_block_h: 00e8a95f, total_difficulty: Difficulty { num: 517433483 } } Mar 08 16:50:38.564 INFO save sync head Tip { height: 75730, last_block_h: 64789d2c, prev_block_h: 00e8a95f, total_difficulty: Difficulty { num: 517433483 } } Mar 08 16:50:38.564 INFO Chain init: Tip { height: 75730, last_block_h: 64789d2c, prev_block_h: 00e8a95f, total_difficulty: Difficulty { num: 517433483 } } Mar 08 16:50:38.637 WARN P2P server started on 127.0.0.1:15414

sesam commented 6 years ago

With a bit more outputs:

Mar 08 19:56:15.079 WARN Grin server started.
Mar 08 19:56:45.079 INFO get header head Ok(Tip { height: 76241, last_block_h: 3ebc25c5, prev_block_h: 503c538d, total_difficulty: Difficulty { num: 517578886 } })
Mar 08 19:56:45.079 INFO get sync head Ok(Tip { height: 76241, last_block_h: 3ebc25c5, prev_block_h: 503c538d, total_difficulty: Difficulty { num: 517578886 } })
Mar 08 19:56:46.944 INFO get header head Ok(Tip { height: 76241, last_block_h: 3ebc25c5, prev_block_h: 503c538d, total_difficulty: Difficulty { num: 517578886 } })
Mar 08 19:56:46.944 INFO get sync head Ok(Tip { height: 76241, last_block_h: 3ebc25c5, prev_block_h: 503c538d, total_difficulty: Difficulty { num: 517578886 } })

**Mar 08 19:56:47.076 INFO Rejected block 03f0cca5 at 75731: InvalidRoot
Mar 08 19:56:47.076 INFO save header head Tip { height: 75730, last_block_h: 64789d2c, prev_block_h: 00e8a95f, total_difficulty: Difficulty { num: 517433483 } }
Mar 08 19:56:47.076 INFO save sync head Tip { height: 75730, last_block_h: 64789d2c, prev_block_h: 00e8a95f, total_difficulty: Difficulty { num: 517433483 } }**

Mar 08 19:56:50.970 INFO get header head Ok(Tip { height: 75730, last_block_h: 64789d2c, prev_block_h: 00e8a95f, total_difficulty: Difficulty { num: 517433483 } })
Mar 08 19:56:50.970 INFO get sync head Ok(Tip { height: 75730, last_block_h: 64789d2c, prev_block_h: 00e8a95f, total_difficulty: Difficulty { num: 517433483 } })

More detailed around the InvalidRoot:

Mar 08 20:00:06.123 DEBG validate_block: utxo roots - ba89bfc0, ba89bfc0
Mar 08 20:00:06.123 DEBG validate_block: rproof roots - 918f1852, 918f1852
Mar 08 20:00:06.123 DEBG validate_block: kernel roots - 0f33d2c7, 1ac1202a
Mar 08 20:00:06.123 DEBG Error returned, discarding sumtree extension.
Mar 08 20:00:06.123 INFO Rejected block 03f0cca5 at 75731: InvalidRoot
Mar 08 20:00:06.124 DEBG Block 03f0cca5 refused by chain: InvalidRoot
**Mar 08 20:00:06.124 DEBG block_received: 03f0cca5 is a bad block, resetting head
Mar 08 20:00:06.124 INFO save header head Tip { height: 75730, last_block_h: 64789d2c, prev_block_h: 00e8a95f, total_difficulty: Difficulty { num: 517433483 } }
Mar 08 20:00:06.124 INFO save sync head Tip { height: 75730, last_block_h: 64789d2c, prev_block_h: 00e8a95f, total_difficulty: Difficulty { num: 517433483 } }**
Mar 08 20:00:06.125 DEBG save_peer: V4(188.68.***.***:13414) marked Banned

So as long as we have at least one node that isn't yet banned and that sends out bad blocks, the header head and sync head will keep resetting.

ignopeverell commented 6 years ago

On restart, I think it's a fair assumption to think a non-negligible amount of time has passed and so the header chain could have changed (or the peer you downloaded from could be gone). So I think resetting the sync head is fair. For the header head I'm less certain, maybe @antiochp would remember why it was reset as well.

sesam commented 6 years ago

Thanks. Meanwhile, I'm trying to sync up my node, and maybe the root cause is elsewhere, but on each restart waiting for sync to go from 75k and up (or not; and just fail) was wearing on me.

I get "Rejected block 03f0cca5 at 75731: InvalidRoot" and that bumps me back down to the body chain's tip. I'm testing some code that would bump down just 10 blocks at a time, to avoid hostile fake block spamming from preventing all nodes from ever syncing up, but no luck (or not enough skill:)) so far.

antiochp commented 6 years ago

There was a reason why we reset the header head on restart. But I don't remember what it is... And its possible the reason is no longer valid.

I think in the presence of a couple of forks on testnet1, not resetting the header head was causing problems with the head jumping around between conflicting forks. By resetting the header head we forced the node to go find the "most work peer" and trust that that was where we wanted to sync to.

Writing this down - the explanation above does not seem 100% convincing - so maybe the best approach is to modify the functionality here and see what happens on testnet1?

Testing this is definitely complicated by the fact that this logic is only really exercised when we have more than one long-lived fork conflicting with the "real" head.

sesam commented 6 years ago

Thank you. It's possible that just having more non-NATed fully synced nodes around would completely solve this, and I understand why it's dangerous to not reset sync & header to chain tip, but maybe the reset could be further conditioned so we don't hit is as often. After we get abbrev. sync and if it's possible to manually add a peer, this issue might become moot.