koinos / koinos-p2p

The p2p microservice orchestrates the distribution of blocks and transactions between peers.
MIT License
6 stars 4 forks source link

[BUG]: On testnet 4, unknown previous block errors during sync #217

Closed theoreticalbts closed 2 years ago

theoreticalbts commented 2 years ago

Is there an existing issue for this?

Current behavior

When attempting to sync Testnet 4, I got quite a few of these:

p2p_1                  | 2022-08-23 20:26:19.075986 (p2p.Koinos) [p2p/error_handler.go:81] <info>: Encountered peer error: QmcGiTpSm6YrmYo3rWoqrCPez2aJY4VdraBQsGsZKwFRuG, block application failed: local RPC error ApplyBlock, chain rpc error, unknown previous block. Current error score: 104841

The error score increased until the peer was disconnected, and I ended up with no peers.

Expected behavior

This error shouldn't occur (unless there's a peer doing something weird, like running possibly malicious modified p2p code, or sending blocks out of order).

Steps to reproduce

Environment

- OS: Ubuntu 22.04 (x86-64)

Anything else?

I have a guess of what's causing the error. I haven't checked the code, it's just a hypothesis that seems to fit my observations.

Normally parallelizing requests A, B, C is good for performance but it causes a race for block submissions because the requests are dependent. If A=submit block 101, B=submit block 102, C=submit block 103, you can't parallelize A, B, C. And that kind of rapid fire submission of a chain of dependent blocks is exactly what happens during a sync.

I think that somewhere, somehow, requests are getting improperly parallelized or reordered. Where could this be happening? Here are my guesses:

The words there might be in italics means "This is something that, if it's actually this way, would explain the bug. But I haven't looked to see if it's actually this way, so it might not actually be this way."

So if this is our problem, two solutions immediately come to mind:

Simple solution: Add a lock to p2p so only one block submission can be in flight at a time. This may have severe performance penalties for sync.

Complicated solution: Create an API that allows submitting multiple blocks in a single batch message. This guarantees the messages aren't reordered by putting them into a single application-level message. The p2p code for properly using such an API would be moderately complicated:

The idea of the complicated solution is this: The simple solution has a performance penalty for sync because the interprocess round-trip time dominates the time required to process a block. You can't pipeline and are forced to wait for that round-trip, because some thread or async thing (or possibly RabbitMQ semantics allowing reordering of messages) causes a race if you try.

The complicated solution recognizes that you still have a restriction of one block submission at a time, but if you make that submission a batch, you can amortize the round-trip time over the batch size and improve performance-per-block.