koinos / koinos-p2p

The p2p microservice orchestrates the distribution of blocks and transactions between peers.
MIT License
6 stars 4 forks source link

Make the applicator error channel bufferred #232

Closed mvandeberg closed 2 years ago

mvandeberg commented 2 years ago

Brief description

Make the error channel a buffered channel so that if the block application times out, the calling go routine does not get blocked.

Checklist

Demonstration

theoreticalbts commented 2 years ago

For posterity I'll post some of the investigation that led to this patch.

The investigation started when some nodes stopped getting new blocks after about 400,000 - 600,000 blocks.

I eventually realized this behavior is consistent with a problem in the block applicator. As of this writing the block applicator is (1) fairly new and (2) recently had its plumbing modified. This didn't yet explain the problem, but it did suggest further investigation of the block applicator may be fruitful.

Then I saw the error channel was initialized without a capacity, which is very suspicious, because usually concurrent code needs the semantics of a queue, not a rendezvous. (I would go so far as to suggest amending our Go coding standards to say that any code that initializes a channel without a capacity should fail review, unless it has a comment carefully explaining why it's correct to use rendezvous in this specific situation.)

Once the potential problem line was found, the hard part was over. It was comparatively simple to reason through the consequences and confirm the diagnosis is consistent with the symptoms:

At this point, we can pretty confidently declare we've found the bug: A rare but not impossible event putting p2p into a state where it sees timeouts on all subsequent blocks, but those blocks never hit chain, is consistent with the above explanation; the observed symptoms in the logs; and the fact that affected nodes processed hundreds of thousands of blocks successfully before the problem appeared, and got stuck at different places.

The fix, of course, is to initialize errChan with a capacity (of 1; no more than one message will ever be sent on a given errChan, so any larger capacity is simply wasteful). This means the errChan sender (i.e., the applyBlockChan drain loop) never blocks, and all is well again.