Open arkpar opened 5 years ago
We might want to use bounded channels where we currently use unbounded. There are currently two sets of channels used:
A channel from network-libp2p to sync. The protocol_sender
used here https://github.com/paritytech/substrate/blob/a6ca090c3e2e44b24f34319648f66b64575369e0/core/network/src/service.rs#L540
A channel from sync/protocol to the import-queue, used when import-blocks
or import-justification
is called on the ImportQueue
trait.
Then the import-queue actually uses the same protocol
channel as libp2p to communicate import results back to sync via the NetworkLink
. https://github.com/paritytech/substrate/blob/a6ca090c3e2e44b24f34319648f66b64575369e0/core/network/src/service.rs#L76
@tomaka would it be acceptable to block the networking runtime when protocol is slow to handle a message sent to it? This would happen if we were to use a bounded channel, which would block on send
until the message is received by sync/protocol.
If acceptable, we could bound the channel to sync, and the channel from sync to the import-queue, making network-libp2p block until message are handled by sync, and making sync block when sending an import message to the import-queue until that message has been handled(not until the import has finished, just until the message is received). Currently import-requests would basically queue up in the buffer of the channel.
Then I would suggest we introduce a dedicated import-queue -> sync channel to communicate import results back, and make that unbounded. That way the "input" in the system is bounded, but not the "output" from the importing. Otherwise the import-queue and sync could end-up deadlocking when both are waiting for the other to receive a message...
I would also like to point to potential timing based attacks Loophole: Timing Attacks on Shared Event Loops in Chrome.
For example currently, when a block import fails, we immediatly go into "error" mode at https://github.com/paritytech/substrate/blob/a6ca090c3e2e44b24f34319648f66b64575369e0/core/consensus/common/src/import_queue.rs#L501
This is an example of a potential timing loophole, since a batch of blocks containing an error will "finish" more quickly than if we were to actually import all of them. A potential solution is to make import_a_batch_of_blocks
always take a constant amount of time, regardless of whether there was an "early exit".
@gterzian The block is public anyway, why would you want to attack to reveal the content of a block? These constant timing stuff is mostly used to prevent leakage of information. If we always take constant time to check a block, we are probably open an attack vector. Regarding networking stuff, networking should support back-pressure and thus using a bounded channel should be no problem.
Since we ask only one single node at a time, if that node takes a long time to respond on purpose, we might not get the block data in time.
We should modify the code to start asking other nodes over time if the response doesn't come.
Syncing code should be reviewed for any potential