Open scottyeager opened 1 month ago
Here's another look at some of the data, this time for a bit more of a "real world" test. I'm also executing both ping and SSH probes against two VMs that are running on Zos nodes in the GreenEdge St Gallen farm. Anecdotally there seems to be intermittent issues with SSH connections to Zos VMs over Mycelium, so the intention here was to try to capture that in action.
Here's the SSH probe results via Mycelium to one VM over a ten day period:
Here's the ping performance via Mycelium to the same VM for the same period:
Next I'll show ping performance over Mycelium to the four public nodes closest to St Gallen:
Also pings to those public nodes over IPv4:
I'm not sure which nodes exactly are involved in routing to this VM. All I can see in the Mycelium logs is that the next hop for the route is the public node US West. Here's the ping graph for both Mycelium and IPv4 for that node:
I suppose the traffic could also be traversing the US East node before crossing the pond, so for good measure:
So the summary is that we can notice some periods where there is significant loss of SSH messages over Mycelium which seem to have some correlation with ping packet loss over Mycelium as well.
There's quite some things going on here so I'll start by trying to explain a few of the behaviors seen here. The current mycelium transports are tcp and Quic reliable channels, which behave like tcp (ordered reliable delivery with acknowledgements and congestion control). This is actually not the greatest for an overlay network, since the overlay assumes it working on IP, which is by nature lossy. It would be better if we could use UDP, though right now plain UDP is not really feasible since we rely on the reliable semantics for the protocol messages. I have some work for using quic datagrams for actual data on a branch, though that needs more tuning it seems to be useful.
Packet loss in the underlay, when using these transports, generally translates to higher latencies due to retransmission, though it is interesting to see that packet loss is worse in the overlay.
Latency spikes are generally somewhat expected, since mycelium is a userspace process, where general network handling is done in the kernel (especially for ping). Currently we are doing continuous network throughput testing, in general this also causes spikes in packets handled every 5ish minutes, which translates to somewhat higher latency during these periods.
For ssh sessions, part of this is likely due to tcp metldown (the negative affect of packetloss in tcp in tcp, where both overlay and underlay sessions independently try to recover the lost packets, leading to them fighting against eachother).
In general, some more testing will have to be done in none optimal network conditions to see how the network behaves (high latency with some minor packet loss), but this is an interesting starting point
I've been collecting some data using SmokePing to get a sense for Mycelium performance from my perspective at home. A description of the methodology and everything needed to reproduce my approach is on this repo.
In summary, I'm connecting to all public Mycelium nodes via IPv4 TCP and pinging them periodically both over Mycelium and over IPv4. This was meant mostly to be a benchmark to evaluate other Mycelium hosts against, but it's revealed some trends that I think are worth highlighting. For most of the public nodes, I observe large variations in latency over Mycelium versus regular IPv4.
Here are some high level graphs, first showing the IPv4 ping performance to the public nodes. The line represents median ping time, with the "smoke" representing deviations from the median:
We can see that latency to these nodes is basically flatline with occasional minor deviations. This sample is representative of the data I've collected so far.
Here's the same view, but for pings sent over Mycelium:
Sometimes the behavior over Mycelium seems to be related an issue also seen on regular IPv4, but sometimes not. Here's an example from the SG node of a rather substantial latency spike on Mycelium:
But IPv4 looks rather clean over the same time period:
Here's a case where the issue observed on IPv4 seems to be amplified over Mycelium. Here we see relatively high packet loss in purple and pink:
Versus relatively low packet loss over IPv4 at the same time:
Here's a longer window, showing the large latency swings and a period of packet loss on Mycelium:
Versus IPv4:
So what I see overall is that median latency over Mycelium can vary by 100% hour to hour for directly connected public peers, while the medians over IPv4 to the same nodes tend to vary by no more than 5-10%.
It also appears that small amounts of packet loss on the underlay network get amplified into larger packet loss over Mycelium.
However, the latency to my closest public peer, US West, is much more stable. That could of course a be a coincidence, which could be cleared up by running the same test from different locations.
It's possible that latency is shifting along with load on the public nodes, and perhaps strain on the Mycelium process could cause these results. I don't have visibility to say whether that's happening, but it doesn't seem likely at the current level of exposure the project has and also the lack of clear correlation between nodes for times of high latency.