Halt the node if blocks don't arrive in a timely manner

hackaugusto commented 4 years ago

Abstract

It could happen that the layer 1 node used by Raiden fails to synchronize with the canonical chain, this can happen for multiple reasons, DoS attacks, human error, bugs, network problems, etc..

Motivation

If the layer 1 node fails to synchronize with the chain important events won't be available (namely the channel close). Because of this uncertainty if the node detects lack of incoming blocks, it has to stop accepting payments, since it may be impossible to claim the funds if the channel has been settled.

The assumption here is that even though the layer 1 node used by the Raiden node may not be operating properly, one of the available monitoring services will be able to operate, so proper settlement will be available but the user node only has to stop accepting payments.

Specification

The race among channel closed and accepting payments will always exist, the race is handled by allowing the non-closing participant to call updateTransfer later, until the settlement window is over, using the latest received balance proof. So accepting transfers is not a problem in itself, as long as the updateTransfer can be safely called.

The problem we have to fix here is to detect when the layer 1 node is lagging, and define for how many block this is tolerable.

By definition, the reveal_timeout is how long it takes for a transaction to be mined (considering block and transaction propagation delays), and settle_timeout is the number of blocks a channel allows for the updateTransfer to be called, so the min(channel.settle_timeout - channel.reveal_timeout for channel in all_channels) is the limit of how many block a Raiden node can tolerate being behind canonical chain.

The previous claim assumes that transactions can be successfully sent to the network through the lagging node, if that is not the case, then the delay for the messages to propagate to the monitoring services and for them to react has to be factored in.

On top of the previous upper limit we should consider good behavior, were a node should behave as to minimize the number of transactions sent. This means we have to take into account the behavior of the monitoring servers. During normal operation, once a node realizes one of its channels is closed, it will readily stopping using it, this reduces the number of transactions the node has to sent on chain [1]. This means the node has to consider when monitoring services would start sending transactions on the user's behalf, this is defined by the firstBlockAllowedToMonitor function, which allows a MS to monitor on the users' behalf after 30% of the settlement window has elapsed. min(channel.settle_timeout * 0.3 for channel in all_channels). The minimum of both of the above formulas is the upper bound for a well behaving node.

1- Continuing to accepting transfers on a channel that is known to be closed doesn't make sense, in effect it is throwing away the security margin from the settlement window. If one doesn't want this it may as well just open the channel with a smaller settlement timeout.

Backwards Compatibility

This is backwards compatible.

christianbrb commented 3 years ago

@karlb

@Dominik1999 Recommended us to implement this in the Light Client.

Is there a reason why this hasn't been implemented in Raiden Py yet? Are there open questions or concerns?

CC @andrevmatos

karlb commented 3 years ago

We check that the eth client is up to date when starting the Raiden node. Checks at run time are not implemented yet, but there is nothing specific blocking us from doing so.

raiden-network / raiden