sync process simplification

antiochp commented 6 years ago

I wonder if we are over complicating the sync process. Particularly if we are planning to keep the sync running in the background to check for peers appearing with higher total difficulty.

The requirements seem to be pretty simple -

1) ask for more headers from an appropriate peer (most work, or at least more work than us) 2) if we have any headers that we don't have blocks for then ask for those blocks 3) goto 1

I'm also leaning toward "make it work" then "make it fast". The download list + 8 download slots + parallel downloads + retries + initial sync flag add a lot of complexity to this process.

Possible approach (please poke holes in this idea) -

One thread looking for peers that are "ahead" (more total difficulty than us)
- if we see one, go ask the peer for headers
One thread looking for "missing blocks" where we have the header in the header chain but no block in our store
- if we see missing blocks go ask for them from peers likely to have them (maybe ask multiple peers for redundancy)

Both these threads just keep looping and looking for things to ask for. Neither thread needs to know anything about what the other is doing. Neither needs to stop once initial sync is done.

To prevent us flooding the network after retrieving old blocks - maybe we suppress the block broadcast if we are processing blocks significantly behind our header chain height (i.e. we are in "sync/catchup" mode).

sesam commented 6 years ago

parallellisation is probably needed since peers are not guaranteed to be low latency high bandwidth. So best to randomly pick a few and hope they average out.

I'd be interested in more speed in exchange for hogging CPU more.

My context: I'm on latest Testnet1 and are spending 2+ days restarting "grin server run" just to try to catch up with head of chain. I started looking at cargo-profiler and if that could help see if there are some surprising wait times that could be from lock starvations, though would need to read up on the profiler crate more.

antiochp commented 6 years ago

Agreed - we kind of get the parallelization "for free" because of the way we set the peer connections up. The actual sync process just kicks off requests via our connected peers.

So I think we can drastically simplify the higher level sync process to just loop looking for things we need to go get (via the peers). For headers - do we see a connected peer with higher total difficulty? Yes? Go get headers. For blocks - do we have more headers than blocks? Yes? Go get them.

yeastplume commented 6 years ago

In all honestly, it doesn't feel a million miles off, there just seem to be some annoying errors outstanding that aren't fundamental to anything, they just need some ironing out.

I was digging into issues earlier trying to get to the bottom of a few of them, and, for my own understanding, can I ask why this is happening:

From the logs, trying to sync:

Nov 23 11:48:54.508 DEBG Sync: locator heights: [9105, 9103, 9099, 9091, 9075, 9043, 8979, 8851, 8595, 8083, 7059, 5011, 915, 0]
Nov 23 11:48:54.508 DEBG Sync: locator: [5c4f1eff, 276eb465, 39748b11, 4d19bd21, 2b03e7b3, e17ea64f]
Nov 23 11:48:54.508 DEBG Sync: Asking peer 88.99.251.87:13414 for more block headers, locator: [5c4f1eff, 276eb465, 39748b11, 4d19bd21, 2b03e7b3, e17ea64f]
Nov 23 11:48:54.508 DEBG Requesting block 07a12d22 from peer 88.99.251.87:13414.
Nov 23 11:48:54.508 DEBG Requesting block e655bebc from peer 88.99.251.87:13414.
Nov 23 11:48:54.509 DEBG Requesting block f7062f69 from peer 88.99.251.87:13414.

So we've asked a peer for a list of locators, from the genesis block to 8083.. in this case because we didn't have anything in our local chain higher than 8083 in our local chain (the change I made earlier allows this to work, it was just failing and stopping the syncer earlier.

So it goes and gets them from 5012 (which I already have...

Nov 23 11:48:54.690 DEBG Received 108 peer addrs, saving.
Nov 23 11:48:54.752 INFO Received 511 block headers
Nov 23 11:48:54.752 DEBG Processing header 1b9fd3ca at 5012.
Nov 23 11:48:54.754 INFO Received unfit block header 1b9fd3ca at 5012: already in store.
Nov 23 11:48:54.754 DEBG Processing header 8819e11d at 5013.
Nov 23 11:48:54.756 INFO Received unfit block header 8819e11d at 5013: already in store.
Nov 23 11:48:54.756 DEBG Processing header 3dd128ac at 5014.
Nov 23 11:48:54.757 INFO Received unfit block header 3dd128ac at 5014: already in store.
Nov 23 11:48:54.758 DEBG Processing header 6f8beaf8 at 5015.
Nov 23 11:48:54.759 INFO Received unfit block header 6f8beaf8 at 5015: already in store.
Nov 23 11:48:54.760 DEBG Processing header 6ed93f15 at 5016.

Until 5289, which I dont...

Nov 23 11:48:55.629 DEBG Processing header 08dc35b5 at 5288.
Nov 23 11:48:55.630 INFO Received unfit block header 08dc35b5 at 5288: already in store.
Nov 23 11:48:55.631 DEBG Processing header b58d620f at 5289.
Nov 23 11:48:55.631 DEBG Validating block with cuckoo size 16
Nov 23 11:48:55.639 DEBG Processing header 79e143fb at 5290.
Nov 23 11:48:55.639 DEBG Validating block with cuckoo size 16
Nov 23 11:48:55.644 DEBG Processing header 2c348180 at 5291.

Then stops at 5506, (which doesn't correspond with anything in particular in the locator list):

Nov 23 11:48:56.507 DEBG Processing header 0750e1b0 at 5506.
Nov 23 11:48:56.507 DEBG Validating block with cuckoo size 16
Nov 23 11:48:56.511 DEBG Sync: peer 33816467 vs us 29961528
Nov 23 11:48:56.511 DEBG Sync: blocks to download 521, block downloading 6
Nov 23 11:48:56.511 DEBG Sync: locator heights: [9105, 9103, 9099, 9091, 9075, 9043, 8979, 8851, 8595, 8083, 7059, 5011, 915, 0]
Nov 23 11:48:56.511 DEBG Sync: locator: [5c4f1eff, 276eb465, 39748b11, 4d19bd21, 2b03e7b3, e17ea64f]
Nov 23 11:48:56.511 DEBG Sync: Asking peer 88.99.251.87:13414 for more block headers, locator: [5c4f1eff, 276eb465, 39748b11, 4d19bd21, 2b03e7b3, e17ea64f]
Nov 23 11:48:56.511 DEBG Requesting block 35b5bbbe from peer 88.99.251.87:13414.

Now it's asking for the same locators again, and gets the same list again:

Nov 23 11:48:56.741 INFO Received 511 block headers
Nov 23 11:48:56.742 DEBG Processing header 1b9fd3ca at 5012.
Nov 23 11:48:56.744 INFO Received unfit block header 1b9fd3ca at 5012: already in store.
Nov 23 11:48:56.744 DEBG Processing header 8819e11d at 5013.
Nov 23 11:48:56.746 INFO Received unfit block header 8819e11d at 5013: already in store.
Nov 23 11:48:56.746 DEBG Processing header 3dd128ac at 5014.
Nov 23 11:48:56.748 INFO Received unfit block header 3dd128ac at 5014: already in store.

And does the exact same again... apparently the chain has just forgotten all about the headers we just processed past 5289

Nov 23 11:48:57.584 DEBG Processing header 08dc35b5 at 5288.
Nov 23 11:48:57.585 INFO Received unfit block header 08dc35b5 at 5288: already in store.
Nov 23 11:48:57.585 DEBG Processing header b58d620f at 5289.
Nov 23 11:48:57.585 DEBG Validating block with cuckoo size 16
Nov 23 11:48:57.592 DEBG Processing header 79e143fb at 5290.
Nov 23 11:48:57.592 DEBG Validating block with cuckoo size 16

Is there possibly a synchronisation issue between threads? The chain instance is Arc locked, so in theory there shouldn't be an issue here.

antiochp commented 6 years ago

There's an issue in github I think for the "same locators, stuck in a loop" issue. A peer will only return a max of 512 headers. So there is an edge case where the locator is generated with a gap between hashes of > 512 and we never get enough headers back from the peer to close the gap.

yeastplume commented 6 years ago

Okay, thanks. But this doesn't sound like an edge-case though, this would happen on every sync, particularly on the earlier blocks. On another node that was at a higher tip, it seemed to return all of the headers up until the next locator-1, which made it seem as if there was just an off-by-one error in closing the gap.

yeastplume commented 6 years ago

Indeed, you can see it very plainly when starting a fresh sync. I just think before we go slashing things and simplifying, it really seems what's there should be working, and the reason it isn't isn't because of fundamental design issues; there are just some (hopefully minor, but tricky to track down) issues to iron out :D

yeastplume commented 6 years ago

Ahh, never mind, the spacing out of the locators changes as the number of headers you have get larger, so having a gap of 512 should be an edge case. Overall sentiment still stands, though :D

yeastplume commented 6 years ago

@antiochp see my further comments on this issue here.. unless I'm mistaken, it looks as if in the current state of things we can only seem to headers up to the last block we've actually validated in many cases: https://github.com/mimblewimble/grin/issues/373

ignopeverell commented 6 years ago

I'm tempted to agree with @yeastplume here. I think if you look at the details of your "get missing blocks" implementation, there's more complication than it seems. When the header chain is 100k blocks long, you can't easily keep track of all "holes". And you have to have retries anyway.

The only conceptual difference I see from what you propose is the body download has a window, which makes it simpler in tracking where you're at catching up with headers.

sesam commented 6 years ago

@antiochp Maybe accepted_orphans is not used (needed?) by simple_sync? Naive question perhaps - asking since I just tried to cherry-pick + build on top of Testnet, and got a warning about accepted_orphans being declared mutable but not being updated

antiochp commented 6 years ago

Proposed impl - #389

yeastplume commented 6 years ago

Just posting here to avoid creating further issues on sync. This is on the testnet branch (not the testnet_ban branch, which seems to just eliminate all peers until it settles on some peer's aborted fork in this instance).

What's in the peer banning appears to be correct, however it doesn't rectify this issue, in which instances of a certain block were created with double-spends (which we fixed earlier with the sumtree fixes.

Note receiving the bad block e3d below, from 165.227:

Nov 28 14:18:02.470 DEBG Requesting block e3d414f3 from peer 165.227.63.166:13414.
.
Nov 28 14:18:03.340 DEBG Received block e3d414f3 at 10088 from network, going to process.
Nov 28 14:18:03.341 DEBG pipe: process_block e3d414f3 at 10088 with 448 inputs and 3 outputs.
Nov 28 14:18:03.341 DEBG pipe: validate_header cuckoo size 16
Nov 28 14:18:03.344 DEBG Starting new sumtree extension.
Nov 28 14:18:03.470 DEBG block_sync: loop
Nov 28 14:18:03.543 DEBG Error returned, discarding sumtree extension.
Nov 28 14:18:03.543 INFO Rejected block e3d414f3 at 10088: AlreadySpent
Nov 28 14:18:03.543 DEBG Block e3d414f3 refused by chain: AlreadySpent

Then after restart, request the same block from the same peer, and all is well:

Nov 28 14:18:11.338 INFO Starting the Grin server...
.
Nov 28 14:18:17.904 DEBG Requesting block e3d414f3 from peer 165.227.63.166:13414.
.
Nov 28 14:18:18.234 DEBG pipe: process_block e3d414f3 at 10088 with 448 inputs and 3 outputs.
.
Nov 28 14:18:19.188 DEBG pipe: proces_block e3d414f3 at 10088 is valid, save and append.
Nov 28 14:18:19.224 INFO Updated head to e3d414f3 at 10088.
Nov 28 14:18:19.225 DEBG Committing sumtree extension.
Nov 28 14:18:19.242 DEBG Sumtree extension done.

And on we go? Pretty sure this isn't a peer banning issue, but something else (that's also causing the ban version to choke very hard on this block)

yeastplume commented 6 years ago

Oooh.. if I'm correct this is nasty... is this perhaps related to this:

https://github.com/mimblewimble/grin/issues/371?

If so that block should always be rejected as a double spend... the only reason it's being accepted the second time is that the sumtree length is being initialised incorrectly.. which seems to correlate with the fix to sumtree lengths we made earlier.

If this is true.. then everyone's blockchain is actually invalid and the _ban version is actually doing the right thing, that is rejecting everyone's chain and considering itself the one true chain...

the drama!

yeastplume commented 6 years ago

Yes, and the _ban update rolls back the head and updates the position correctly, so it keeps rejecting the double spend block... bad news is most of the network is invalid. Good news is grin appears to be working correctly.

From the ban version (without restart):

Nov 28 14:57:50.002 INFO Rejected block f697a877 at 4689: AlreadySpent
Nov 28 14:57:50.002 DEBG Block f697a877 refused by chain: AlreadySpent
Nov 28 14:57:50.003 DEBG saving peer to store PeerData { addr: V4(165.227.63.166:13414), capabilities: FULL_HIST | UTXO_HIST | PEER_LIST | FULL_NODE, user_agent: "MW/Grin 0.1", flags: Banned }
Nov 28 14:57:50.003 DEBG Banning peer 165.227.63.166:13414
.
.
Nov 28 15:02:50.188 DEBG Requesting block 5fb983b9 from peer 109.242.124.22:13414.
.
Nov 28 15:02:50.664 DEBG Received block 5fb983b9 at 4689 from network, going to process.
Nov 28 15:02:50.664 DEBG pipe: process_block 5fb983b9 at 4689 with 0 inputs and 1 outputs.
Nov 28 15:02:50.665 DEBG pipe: validate_header cuckoo size 16
Nov 28 15:02:50.666 DEBG Starting new sumtree extension.
Nov 28 15:02:50.697 TRCE Hashing with additional data
Nov 28 15:02:50.700 DEBG pipe: proces_block 5fb983b9 at 4689 is valid, save and append.
Nov 28 15:02:50.700 INFO Updated head to 5fb983b9 at 4689.

109.242 is correct, everyone else is wrong.

ignopeverell commented 6 years ago

Correct, as long as nodes keep sending f697a877, they'll get banned. And that's the correct behavior. Now there seems to be a few valid branches with an alternate 4689 and it looks like I had one node on the _ban branch that still ended up accepting f697a877. Looks like there are still a couple issues (maybe #371 as you supposed) but getting there!

yeastplume commented 6 years ago

I think it's still possible to get past 4689 on the ban branch as well if you restart the node at the right (or wrong) time.. the fix (I think) is really #371 to ensure a persisted sumtree always has its true length persisted alongside it somehow.

mimblewimble / grin

sync process simplification #384

Possible approach (please poke holes in this idea) -