[RfC] Node-to-node network latency measurement

leoluk commented 4 years ago

Problem

End-to-end latency and geographical clustering has a huge impact on cluster performance, but it's impossible to reason about without measuring real end-to-end network latency (scenic routing, assymetric routes, congested peerings, and other unexpected network topologies).

Proposed Solution

Implement an additional UDP server on its own port, such that it has its own queue (perhaps a good use for the storage port?). This service implements a simple, stateless echo request/reply mechanism.

The node sends echo requests to every other node in the network at a low-but-reasonable interval (50ms? 100ms?), compresses the measurements and makes them available in a yet-to-be-determined fashion.

aeyakovenko commented 4 years ago

you don't think we can measure this from the votes?

#[derive(Serialize, Default, Deserialize, Debug, PartialEq, Eq, Clone)]
pub struct Vote {
    /// A stack of votes starting with the oldest vote
    pub slots: Vec<Slot>,
    /// signature of the bank's state at the last slot
    pub hash: Hash,
    /// processing timestamp of last slot
    pub timestamp: Option<UnixTimestamp>,
}

they are already in CRDS. periodically every node sets the timestamp

leoluk commented 4 years ago

The gossip network is not necessarily a full mesh (or is it?) and the timestamps are subject to gossip propagation delay, gossip queue congestion and vote timing. A dedicated echo service wouldn't have any of these confounding factors and measure only the network latency from each node to every other node.

Updated the title to clarify.

aeyakovenko commented 4 years ago

Ah I see. You want to sample rtt time between any two nodes. We probably need a separate message for that. With eager push in gossip only a subset of the nodes will be 1 hop away. I think the gossip fanout is 6.

leoluk commented 4 years ago

Yes, hence the proposal to have a separate UDP service for the echo server. That way there'll be a separate queue and we can better differentiate between network latency and application layer congestion (like a gossip flood/queue drops).

brianlong commented 4 years ago

FWIW, I am currently recording ping times from my TdS node to all the others. Not everyone is responding to ping, so the data is incomplete, but it does start to give an indication of which nodes are fast or slow (from my node currently in NYC). It will be awesome to aggregate similar data from other nodes.

At the moment, I am running the Ruby script in a single thread every 10 minutes. I intend to use the data for general curiosity & reporting purposes, so I didn't see the need for a shorter sample period. I expect to see some correlation between a node's average network performance and skipped blocks. I haven't done that analysis yet...

I can see a faster sample rate being helpful when Rampy returns.

I am not a Rust developer, but I will do what I can to help!

-- BL

aeyakovenko commented 4 years ago

What about using the health check rpc? https://github.com/solana-labs/solana/pull/9505 Users already choose to expose the RPC publicly or not. One other option is adding a time stamp to pull requests.

leoluk commented 4 years ago

Gossip has lots of confounding factors that might distort measurement results (was the network slow or the gossip thread busy dealing with a flood?).

RPC is TCP and therefore not representative of UDP latency - many ISPs treat UDP differently during congestion. And would have to gather tcp_rtt and tcp_rttvar data from the kernel rather than measuring at the application layer.

aeyakovenko commented 4 years ago

Ok, makes sense. Is this something you want to add? Should be fairly easy since it's fairly independent of the core. We would need to propagate this as a start argument.

leoluk commented 4 years ago

We can build the analytics backend that aggregates and makes sense of the data (much of it already exists as part of another project). As for collecting and exposing the data in Solana, we probably won't have the short- to medium-term engineering capacity to build it.

aeyakovenko commented 4 years ago

@leoluk what about some rules for enabling ICMP for the select validators. folks that want to do this can run the iptables commands to whitelist everyone that has some minimal stake in the network?

leoluk commented 4 years ago

@aeyakovenko Hmm, we can accurately estimate 1:n latency by measuring last-hop latency - this works even if ICMP is blocked. We could ask validators to deploy an active measurement probe alongside their validators, like @brianlong is doing, which collects traceroutes torwards every other node in the network. Should be easy to convince validators to allow ICMP Echo Requests, too.

The question is whether this is enough to get an accurate picture of network conditions and detect subclusters - in this example network with one active probe at (a), it would be impossible to measure edges (3) and (4) - the other vertices might be in the same datacenter or continents apart.

Having every node in the network measure their respective latencies to every other node, however, would allow for a highly accurate picture of cluster topology.

brianlong commented 4 years ago

@leoluk By "last-hop latency", are you referring to line 15 in the traceroute below?

brianlong@solana-tds:~$ traceroute testnet.solana.com
traceroute to testnet.solana.com (216.24.140.155), 30 hops max, 60 byte packets
 1  165.227.96.253 (165.227.96.253)  8.777 ms  8.775 ms  8.768 ms
 2  138.197.248.8 (138.197.248.8)  0.933 ms 138.197.248.28 (138.197.248.28)  0.200 ms 138.197.248.8 (138.197.248.8)  0.294 ms
 3  nyk-b3-link.telia.net (62.115.45.5)  0.821 ms nyk-b3-link.telia.net (62.115.45.9)  2.461 ms nyk-b3-link.telia.net (62.115.45.5)  0.834 ms
 4  * * *
 5  nyk-b2-link.telia.net (62.115.137.99)  1.327 ms nyk-b2-link.telia.net (213.155.130.28)  1.974 ms nyk-b2-link.telia.net (62.115.137.99)  1.405 ms
 6  viawest-ic-350578-nyk-b2.c.telia.net (62.115.181.147)  2.186 ms  2.150 ms  2.000 ms
 7  be21.bbrt01.ewr01.flexential.net (148.66.237.190)  39.777 ms  39.746 ms  39.746 ms
 8  be110.bbrt02.chi01.flexential.net (66.51.5.149)  40.185 ms  39.959 ms  39.881 ms
 9  be10.bbrt01.chi01.flexential.net (66.51.5.117)  39.778 ms  39.736 ms  39.779 ms
10  be105.bbrt01.den05.flexential.net (66.51.5.106)  40.339 ms  40.383 ms  40.402 ms
11  be155.bbrt01.den02.flexential.net (148.66.236.209)  40.377 ms  40.045 ms  40.046 ms
12  be10.bbrt02.den02.flexential.net (148.66.237.41)  39.963 ms  40.347 ms  40.168 ms
13  po32.crsw02.den02.viawest.net (148.66.237.45)  39.870 ms  39.607 ms  39.714 ms
14  te7-1.aggm02.den02.flexential.net (148.66.236.227)  40.234 ms  40.020 ms  40.077 ms
15  usr3-ppp20.lvdi.net (216.24.140.148)  39.712 ms  39.483 ms  39.447 ms
16  * * *

leoluk commented 4 years ago

Yes, the downside is that we're measuring the router's CPU usage as well. This means that extra statistical analysis would be necessary.

(plus it can be hard to tell whether the latency is at the first or the last hop unless you can measure both directions)

behzadnouri commented 4 years ago

For reference the ping/pong packets added in https://github.com/solana-labs/solana/pull/12794 may be utilized for this purpose. We are already maintaining timestamp of pings for rate-limiting purposes: https://github.com/solana-labs/solana/blob/83799356d/core/src/ping_pong.rs#L33-L35 and may compare against the instant the pong packet arrives.

ryoqun commented 3 years ago

as for latency, it should be fairly low (~ 100-150ms) across the mainnet-beta/testnet cluster. I got the number from turbine propagation.

More prominent networking condition would be packet drops.. I'm planning to look at it more deeply.

uri-bloXroute commented 2 years ago

@leoluk we (bloXroute) are just starting to expand to Solana, but we have years of experience measuring network performance at very granular levels (it matters a lot for DeFi traders)

Happy to jam and maybe collaborate if you’re interested

solana-labs / solana

[RfC] Node-to-node network latency measurement #10084

Problem

Proposed Solution