Open leoluk opened 4 years ago
you don't think we can measure this from the votes?
#[derive(Serialize, Default, Deserialize, Debug, PartialEq, Eq, Clone)]
pub struct Vote {
/// A stack of votes starting with the oldest vote
pub slots: Vec<Slot>,
/// signature of the bank's state at the last slot
pub hash: Hash,
/// processing timestamp of last slot
pub timestamp: Option<UnixTimestamp>,
}
they are already in CRDS. periodically every node sets the timestamp
The gossip network is not necessarily a full mesh (or is it?) and the timestamps are subject to gossip propagation delay, gossip queue congestion and vote timing. A dedicated echo service wouldn't have any of these confounding factors and measure only the network latency from each node to every other node.
Updated the title to clarify.
Ah I see. You want to sample rtt time between any two nodes. We probably need a separate message for that. With eager push in gossip only a subset of the nodes will be 1 hop away. I think the gossip fanout is 6.
Yes, hence the proposal to have a separate UDP service for the echo server. That way there'll be a separate queue and we can better differentiate between network latency and application layer congestion (like a gossip flood/queue drops).
FWIW, I am currently recording ping times from my TdS node to all the others. Not everyone is responding to ping, so the data is incomplete, but it does start to give an indication of which nodes are fast or slow (from my node currently in NYC). It will be awesome to aggregate similar data from other nodes.
At the moment, I am running the Ruby script in a single thread every 10 minutes. I intend to use the data for general curiosity & reporting purposes, so I didn't see the need for a shorter sample period. I expect to see some correlation between a node's average network performance and skipped blocks. I haven't done that analysis yet...
I can see a faster sample rate being helpful when Rampy returns.
I am not a Rust developer, but I will do what I can to help!
-- BL
What about using the health check rpc? https://github.com/solana-labs/solana/pull/9505 Users already choose to expose the RPC publicly or not. One other option is adding a time stamp to pull requests.
Gossip has lots of confounding factors that might distort measurement results (was the network slow or the gossip thread busy dealing with a flood?).
RPC is TCP and therefore not representative of UDP latency - many ISPs treat UDP differently during congestion. And would have to gather tcp_rtt and tcp_rttvar data from the kernel rather than measuring at the application layer.
Ok, makes sense. Is this something you want to add? Should be fairly easy since it's fairly independent of the core. We would need to propagate this as a start argument.
We can build the analytics backend that aggregates and makes sense of the data (much of it already exists as part of another project). As for collecting and exposing the data in Solana, we probably won't have the short- to medium-term engineering capacity to build it.
@leoluk what about some rules for enabling ICMP for the select validators. folks that want to do this can run the iptables commands to whitelist everyone that has some minimal stake in the network?
@aeyakovenko Hmm, we can accurately estimate 1:n latency by measuring last-hop latency - this works even if ICMP is blocked. We could ask validators to deploy an active measurement probe alongside their validators, like @brianlong is doing, which collects traceroutes torwards every other node in the network. Should be easy to convince validators to allow ICMP Echo Requests, too.
The question is whether this is enough to get an accurate picture of network conditions and detect subclusters - in this example network with one active probe at (a), it would be impossible to measure edges (3) and (4) - the other vertices might be in the same datacenter or continents apart.
Having every node in the network measure their respective latencies to every other node, however, would allow for a highly accurate picture of cluster topology.
@leoluk By "last-hop latency", are you referring to line 15 in the traceroute below?
brianlong@solana-tds:~$ traceroute testnet.solana.com
traceroute to testnet.solana.com (216.24.140.155), 30 hops max, 60 byte packets
1 165.227.96.253 (165.227.96.253) 8.777 ms 8.775 ms 8.768 ms
2 138.197.248.8 (138.197.248.8) 0.933 ms 138.197.248.28 (138.197.248.28) 0.200 ms 138.197.248.8 (138.197.248.8) 0.294 ms
3 nyk-b3-link.telia.net (62.115.45.5) 0.821 ms nyk-b3-link.telia.net (62.115.45.9) 2.461 ms nyk-b3-link.telia.net (62.115.45.5) 0.834 ms
4 * * *
5 nyk-b2-link.telia.net (62.115.137.99) 1.327 ms nyk-b2-link.telia.net (213.155.130.28) 1.974 ms nyk-b2-link.telia.net (62.115.137.99) 1.405 ms
6 viawest-ic-350578-nyk-b2.c.telia.net (62.115.181.147) 2.186 ms 2.150 ms 2.000 ms
7 be21.bbrt01.ewr01.flexential.net (148.66.237.190) 39.777 ms 39.746 ms 39.746 ms
8 be110.bbrt02.chi01.flexential.net (66.51.5.149) 40.185 ms 39.959 ms 39.881 ms
9 be10.bbrt01.chi01.flexential.net (66.51.5.117) 39.778 ms 39.736 ms 39.779 ms
10 be105.bbrt01.den05.flexential.net (66.51.5.106) 40.339 ms 40.383 ms 40.402 ms
11 be155.bbrt01.den02.flexential.net (148.66.236.209) 40.377 ms 40.045 ms 40.046 ms
12 be10.bbrt02.den02.flexential.net (148.66.237.41) 39.963 ms 40.347 ms 40.168 ms
13 po32.crsw02.den02.viawest.net (148.66.237.45) 39.870 ms 39.607 ms 39.714 ms
14 te7-1.aggm02.den02.flexential.net (148.66.236.227) 40.234 ms 40.020 ms 40.077 ms
15 usr3-ppp20.lvdi.net (216.24.140.148) 39.712 ms 39.483 ms 39.447 ms
16 * * *
Yes, the downside is that we're measuring the router's CPU usage as well. This means that extra statistical analysis would be necessary.
(plus it can be hard to tell whether the latency is at the first or the last hop unless you can measure both directions)
For reference the ping/pong packets added in https://github.com/solana-labs/solana/pull/12794 may be utilized for this purpose. We are already maintaining timestamp of pings for rate-limiting purposes: https://github.com/solana-labs/solana/blob/83799356d/core/src/ping_pong.rs#L33-L35 and may compare against the instant the pong packet arrives.
as for latency, it should be fairly low (~ 100-150ms) across the mainnet-beta/testnet cluster. I got the number from turbine propagation.
More prominent networking condition would be packet drops.. I'm planning to look at it more deeply.
@leoluk we (bloXroute) are just starting to expand to Solana, but we have years of experience measuring network performance at very granular levels (it matters a lot for DeFi traders)
Happy to jam and maybe collaborate if you’re interested
Problem
End-to-end latency and geographical clustering has a huge impact on cluster performance, but it's impossible to reason about without measuring real end-to-end network latency (scenic routing, assymetric routes, congested peerings, and other unexpected network topologies).
Proposed Solution
Implement an additional UDP server on its own port, such that it has its own queue (perhaps a good use for the storage port?). This service implements a simple, stateless echo request/reply mechanism.
The node sends echo requests to every other node in the network at a low-but-reasonable interval (50ms? 100ms?), compresses the measurements and makes them available in a yet-to-be-determined fashion.