lightningnetwork / lnd

Lightning Network Daemon ⚡️
MIT License
7.7k stars 2.09k forks source link

Handle race condition in peer connection #6788

Open yyforyongyu opened 2 years ago

yyforyongyu commented 2 years ago

The Issue

When Alice and Bob each makes a connection to the other side at the same time, it's likely neither connection could succeed. For instance, if Alice restarts while having a channel with Bob, the following will happen,

  1. Bob notices the connection is broken, and will attempt to connect, hence making an outbound connection. Once the attempt is made, any inbound connection will be terminated.
  2. Once Alice finishes restarting, she will make a connection to Bob, hence making an outbound connection, and any inbound connection will be terminated.
  3. Alice will cancel Bob's connection attempt since she has an outbound connection, and,
  4. Bob will cancel Alice's connection attempt since he has an outbound connection.
  5. Neither side will succeed, and will both retry after a few seconds. If unlucky, the connection could remain broken.

Here's the pruned logs taken from itest build, where Carol restarts and Bob fails the connection.

Proposed Solutions

It'd be nice to have the specs specifying who should be the initiator of the connection, or when to drop a connection. For instance, we could have,

  1. it's always the channel initiator makes the connection. This makes sense since if one side is offline, the other side won't be able to make a connection. And if both are online, channel initiator can make the connection to avoid the above race. However, this doesn't solve persistant connections, where nodes are connected manually with a channel open.
  2. when in conflict, always prefer the connection from the node with the larger pub key(lexi order). We could apply this preference to connection initiator too - whoever has the larger pub key initiates the connection request. This should solve the above connection race.

To properly fix it, we may need to flatten some of the logic used in peer connection so we can understand what's going on more easily. Ideally we'd refactor the server.go to make the peer connection its own service.

morehouse commented 1 year ago

Wouldn't randomized backoff be a simpler fix?

Edit: It looks like we do randomized backoff already in server.go. So I guess my question is, why doesn't that avoid this issue?

yyforyongyu commented 1 year ago

@morehouse because both would back off at the same time. During that period, no connection would be alive.

morehouse commented 1 year ago

But if the backoff is sufficiently randomized, then e.g. Alice would wait 5 seconds and Bob would wait 10. So Alice will be able to connect while Bob is still waiting.

yyforyongyu commented 1 year ago

Alice would wait 5 seconds and Bob would wait 10.

During this 5-second period, no connection will be alive, and this is what we want to resolve here. Meanwhile, as stated in the description,

Neither side will succeed, and will both retry after a few seconds. If unlucky, the connection could remain broken.

backoff is sufficiently randomized

What does "sufficiently" mean here? If the range is 1 to 10 seconds, the `if unlucky" part is likely to be true. If the range is really large, say 1 to 3600 seconds, one side may risk not having a live connection for a very long time.

morehouse commented 1 year ago

What does "sufficiently" mean here? If the range is 1 to 10 seconds, the `if unlucky" part is likely to be true. If the range is really large, say 1 to 3600 seconds, one side may risk not having a live connection for a very long time.

Sufficiently randomized would mean, "randomized enough to avoid this problem".

From the attached logs, the collision is happening when connections happen within 0.2 seconds of each other. If this is really common enough to be a problem, I think we should easily be able to fine-tune the backoff parameters to make such collisions unlikely.