swarm: better backoff logic

Stebalien commented 7 years ago

We should try to distinguish between local failures and remote failures. At the very least, we should be resetting our backoffs when new links/routes come online.
We should probably be backing off on a per multiaddr basis, not a per peer basis (unless we establish a connection to the peer and it tells us to to away (need a new protocol for that, related to https://github.com/libp2p/go-libp2p/issues/238).

Came up in: https://github.com/libp2p/go-libp2p-kad-dht/issues/96

mishto commented 6 years ago

Can we expose baseBackoffTime and maxBackoffTime? the default values are arbitrary and different applications may want different settings.

Stebalien commented 6 years ago

Fair enough. Also, it looks like our backoff aren't actually exponential...

Stebalien commented 6 years ago

This will be fixed in large refactor/simplification that's coming down the pipe.

Stebalien commented 6 years ago

Note to self: Refund backoff "tries" after a period of time. Currently, if we go to max-backoff, wait an hour, and then fail a single dial, we'll wait the max backoff again. We should, instead, notice that an hour has passed and forget all the previous failures.

Code:

    now := time.Now()
    if sinceLast := now.Sub(bp.until); sinceLast > 0 {
        // Refund backoff time at the same rate.
        refund := int(math.Sqrt(float64((sinceLast - BackoffBase) / BackoffCoef)))
        if refund < bp.tries {
            bp.tries -= refund
        } else {
            bp.tries = 0
        }
    }

Not going to do this now because we have so many other changes in the pipeline and we may want to discuss this.

mishto commented 6 years ago

Sounds good, thanks.

Stebalien commented 4 years ago

Working through all the different backoff cases:

Backoff trying to find a peer.
- This definitely belongs down in the DHT, or as a wrapper around the DHT.
Backoff a port/ip because a TCP dial failed.
- This could happen inside the transport or inside the swarm itself.
- If it happens inside the transport, we'd need a shared backoff module for backing off dialing multiaddrs with certain prefixes.
- If it happens inside the swarm, we'd need some way to report the backoff to the swarm. We'd probably do this by returning a special error.
Backoff an IP when we get a "no route to IP" error.
- Same as above.
Backoff a port/ip/peer triple when we end up dialing the wrong peer.
- Same as above.
Backoff a peer/transport when we fail to negotiate a muxer/security transport.
- This is an interesting case. Really, we want to backoff the entire peer for all transports using the upgrader upgrader. This is a case where applying the backoff from within the transport is really the only solution that makes sense (as the transport knows what sub-transports it uses).

Stebalien commented 4 years ago

Status: While @petar's patches are likely the right way to go in the future, they introduce quite a few new interfaces that'll need to be discussed. In the interest of getting a fast fix in, @willscott is implementing (#191) a dumb version that just backs off full addresses inside the swarm itself without changing core libp2p interfaces.

That gives us some breathing room.

libp2p / go-libp2p

swarm: better backoff logic #1554