Closed mvandeberg closed 2 years ago
For posterity I'll post some of the investigation that led to this patch.
The investigation started when some nodes stopped getting new blocks after about 400,000 - 600,000 blocks.
Pushing transaction
and Transaction application successful
messages, but after the last successful block there are no more messages about blocks.I eventually realized this behavior is consistent with a problem in the block applicator. As of this writing the block applicator is (1) fairly new and (2) recently had its plumbing modified. This didn't yet explain the problem, but it did suggest further investigation of the block applicator may be fruitful.
Then I saw the error channel was initialized without a capacity, which is very suspicious, because usually concurrent code needs the semantics of a queue, not a rendezvous. (I would go so far as to suggest amending our Go coding standards to say that any code that initializes a channel without a capacity should fail review, unless it has a comment carefully explaining why it's correct to use rendezvous in this specific situation.)
Once the potential problem line was found, the hard part was over. It was comparatively simple to reason through the consequences and confirm the diagnosis is consistent with the symptoms:
requestApplication()
the errChan
is initialized without a size, which according to Go semantics means it blocks anyone trying to send / receive until a matching send / receive comes from the other side (rendezvous semantics).select
on errChan
and ctx.Done()
. So in the (rare, but not impossible) event a peer loses connection or gets terminated for naughty points while it has a block application in-flight, the select
gives up on trying to send the error and instead exits on the ctx.Done()
branch.applyBlockChan
to hang at request.errChan <- err
as nobody's ever going to attempt to send anything on that errChan
ever again.At this point, we can pretty confidently declare we've found the bug: A rare but not impossible event putting p2p into a state where it sees timeouts on all subsequent blocks, but those blocks never hit chain, is consistent with the above explanation; the observed symptoms in the logs; and the fact that affected nodes processed hundreds of thousands of blocks successfully before the problem appeared, and got stuck at different places.
The fix, of course, is to initialize errChan
with a capacity (of 1; no more than one message will ever be sent on a given errChan
, so any larger capacity is simply wasteful). This means the errChan
sender (i.e., the applyBlockChan
drain loop) never blocks, and all is well again.
Brief description
Make the error channel a buffered channel so that if the block application times out, the calling go routine does not get blocked.
Checklist
Demonstration