Reconsider local network timeout

libp2p / go-libp2p

libp2p implementation in Go

MIT License

6.02k stars 1.06k forks source link

Reconsider local network timeout #666

Open Stebalien opened 5 years ago

Stebalien commented 5 years ago

We currently timeout "local" dials after 5 seconds. Unfortunately, "local" latencies may actually be high (e.g. on VPNs).

See https://github.com/ipfs/go-ipfs/issues/6468.

Given that we already have a 10s timeout in the TCP transport (for the handshake), we should consider dropping this (or at least limiting it to localhost).

raulk commented 5 years ago

I’m leaning towards not generalising such a high timeout in order to cater for special cases (vpn). I think we need a dynamic timeout service that runs in the background and:

senses if we’re running behind a vpn by introspecting local interfaces and some other heuristic.
for private addresses, it tracks RTT across different subnet masks and sets timeouts accordingly.

For 2, I can see us starting with speculative, high timeouts as soon as we boot up, adjusting it downwards as we observe connection establishment times for subnets, and upwards when we witness timeouts (retrying those automatically to avoid false backoffs), eventually converging to appropriate values which we could then maybe share with local peers via a dedicated protocol.

Stebalien commented 5 years ago

So, what do we tell users running private IPFS clusters?

raulk commented 5 years ago

That we want to design things carefully, and avoid generalising for special cases? Can IPFS add a config option to set this timeout? IIRC it’s a var in libp2p. And all timeouts should be configurable anyway.

Stebalien commented 5 years ago

We could but most users will just say "ipfs doesn't work" and leave. Really, I'm not sure how useful this additional timeout is (5 seconds for undialable local peers versus 10 for all others).

raulk commented 5 years ago

The flip side of this is routers that swallow packets for unroutable local addrs. I recently had this in a public setting, and it consumed my dial queues for longer than necessary. So in this situation, I’d rather fail fast. That’s why we can’t generalise.

I’d say the average novice is more likely to be running in a setting like the one I experienced. If you’re running a private cluster over VPN, you’re slightly more advanced.

An easy tradeoff right now is to detect VPN interfaces and print a tip on stdout if the setting is not set.

EDITED @Stebalien