Open yyforyongyu opened 2 years ago
Wouldn't randomized backoff be a simpler fix?
Edit: It looks like we do randomized backoff already in server.go. So I guess my question is, why doesn't that avoid this issue?
@morehouse because both would back off at the same time. During that period, no connection would be alive.
But if the backoff is sufficiently randomized, then e.g. Alice would wait 5 seconds and Bob would wait 10. So Alice will be able to connect while Bob is still waiting.
Alice would wait 5 seconds and Bob would wait 10.
During this 5-second period, no connection will be alive, and this is what we want to resolve here. Meanwhile, as stated in the description,
Neither side will succeed, and will both retry after a few seconds. If unlucky, the connection could remain broken.
backoff is sufficiently randomized
What does "sufficiently" mean here? If the range is 1 to 10 seconds, the `if unlucky" part is likely to be true. If the range is really large, say 1 to 3600 seconds, one side may risk not having a live connection for a very long time.
What does "sufficiently" mean here? If the range is 1 to 10 seconds, the `if unlucky" part is likely to be true. If the range is really large, say 1 to 3600 seconds, one side may risk not having a live connection for a very long time.
Sufficiently randomized would mean, "randomized enough to avoid this problem".
From the attached logs, the collision is happening when connections happen within 0.2 seconds of each other. If this is really common enough to be a problem, I think we should easily be able to fine-tune the backoff parameters to make such collisions unlikely.
The Issue
When Alice and Bob each makes a connection to the other side at the same time, it's likely neither connection could succeed. For instance, if Alice restarts while having a channel with Bob, the following will happen,
Here's the pruned logs taken from itest build, where Carol restarts and Bob fails the connection.
Proposed Solutions
It'd be nice to have the specs specifying who should be the initiator of the connection, or when to drop a connection. For instance, we could have,
To properly fix it, we may need to flatten some of the logic used in peer connection so we can understand what's going on more easily. Ideally we'd refactor the
server.go
to make the peer connection its own service.