Yggdrasil connectivity degradation

Since this month (Nov) we started to notice some serious Yggdrasil degradation again. Mostly these are connectivity issues, but the latency in general went up by a lot also. We saw these issues before and could temporary resolve them with our peer list but we knew in advance it would return.

Our monitoring is sending pings over yggdrasil to all public nodes / ygg peers in the peer list. A performance graph of the last month:

This is time vs latency. The unavailability of an yggdrasil endpoint also happens regularly without a clear pattern. You can visit the graph at: https://mon.grid.tf/d/000000016/blackbox-yggdrasil-icmp?orgId=1&refresh=10s&from=now-24h&to=now Keep in mind the above pings are done to the peer list nodes, so the monitoring server has these peers configured. Communication to ygg addresses not in this list is even worse.

I would like to state the urgency of this matter, since problems with yggdrasil reflect on all nets (main, dev, test and qa) and every service that uses yggdrasil. One could say yggdrasil is a backbone service of the grid, like the TFchain is. If it does not work, the Grid does not work .. This will result in mostly timeout issues like these ones:

https://github.com/threefoldtech/test_feedback/issues/321

Since the start of this month we also noticed these kinds of log statements in ygg client logs: https://github.com/threefoldtech/tf_operations/issues/1269#issuecomment-1317304896

Ygdrasil in general seems to degrade with each upgrades, as can be seen here: https://github.com/yggdrasil-network/yggdrasil-go/issues/978

We see no real progress in improving these issues: https://github.com/yggdrasil-network/yggdrasil-go/releases Although the last release might have improved the routing problems? -> https://github.com/yggdrasil-network/yggdrasil-go/releases/tag/v0.4.7 -> Buffers are now reused in the router for DHT and path traffic, which improves overall routing throughput and reduces memory allocations

A possible reason for all this could be point 8, 9 and 10 stated at this page: https://github.com/Arceliar/ironwood/#known-issues

Some essential protocol traffic requires round trips. That can be problematic when latency in the local network and latency of links in the global network differ significantly. In the extreme case, if we take the idea of a "world tree" literally, then a network with nodes on both earth and mars would be unusable even within one planet because of round trip protocol traffic that needs to go between planets. It's worth noting that the initial IP->key lookup and the crypto layer both require round trips, but that's a separate problem which is technically out of scope for the project (it's a research project on routing, not cryptography).

The current DHT is able to prevent traffic from unnecessarily leaving a subnet if there is exactly 1 gateway between the subnet and the rest of the network. Ideally, we would not want to exit a subnet just to re-enter via a different gateway. Going back to the mars example, even if the DHT was able to be constructed and kept consistent, having two gateways could case messages route from mars, to earth, and then back to mars. More realistically, if there's a local mesh network, we would like the network to be able to have multiple gateways to the internet overlay or bridges with other mesh networks, without those added links causing traffic within the local mesh to route outside and back in via a different gateway. This may turn out to be impossible (see e.g. Braess's paradox), but we can probably still do better than ignoring this completely.

Somewhat related to the above, there are some known network topologies where ironwood's stretch is terrible. Rings are the easiest example to point to: if two nodes are near each other, but on different side of the point that's directly opposite to the root, then their traffic will tend to take the long way (through the root) instead of going directly to each other. Ironwood also uses an unreasonable amount of memory on ring networks. We largely don't care about poor performance on rings, since they would lead to high path length even with a stretch-1 routing scheme. Unfortunately, ironwood performs poorly on spheres for largely the same, albeit with less memory use, which could become a major problem when building a large network on earth. There are ways to address this in the treespace routing scheme, but it's difficult to fix this in the DHT without introducing some security / denial-of-service vulnerabilities.

To add on to this: This also impact the freeflow.life demo environment. We have had several demos go wrong as a result of yggdrasil connection issues.

Over the last few days, we've seen this issue manifesting in reports from farmers and users of the grid, that Yggdrasil connections are timing out.

To reiterate the urgency brought above: this is a showstopper for the grid.

I wonder if, at least as a temporary solution, we can regain performance by forking Yggdrasil and running Planetary Network as a separate network.

Interoperability with the larger Yggdrasil network seems to be more of a bug than a feature right now, especially if topology changes (growth) outside the grid can degrade performance inside the grid. I'm not aware of any use cases that involve communicating with servers or users in the larger Yggdrasil network, at the moment.

I discussed with @despiegk, who supports forking Planetary Network into a separate network. This would require a coordinated effort among the development groups and would need to be considered against current high priority items like power management (#1303).

I'll invite input here from @muhamadazmy, @LeeSmet, and @xmonader for opinions on the suitability and feasibility of this plan or any alternatives that should be considered. Let's find consensus and outline the steps needed to execute so we can keep the grid running smoothly.

I'm very much not a fan of creating a fork.

Firstly, creating a fork means we are not fully compatible with the upstream. As a result, we can't just pull in upstream changes, meaning we need to actively maintain this fork, which is a continuous investment of development resources. Secondly, it doesn't aim to fix the root cause, but rather to work around it. Even if this works, there is absolutely no guarantee that it happens again later as the network gets bigger. Thirdly, and most importantly, we fundamentally don't know what the actual issue is. All we have are observations and guesses based on said observations. So before we even start doing anything, we need to either determine what the problem is, or analyze the behavior we see compared to what we know (and expect given experience in networking) and agree on at least a probable cause to try and improve/fix.

For arguments sake, the following section is an analysis of the current situation and my opinion on the matter (which is not backed by hard facts, hence opinion).

The latencies seen in the graph from the original post are pings from "a node" to the "public nodes", which are configured as peers. In other words, the node is actually directly connected to these nodes. Note that some of these spikes go up to about 10seconds. It is also not only "remote" peers, as even peers in the same DC occasionally have multi-second latency on a ping.
We know that our topology is a simple "everyone connects to the same peers", by virtue of the same peer list we use. Every node is connected to the set of public nodes (which is a reflective relationship, every public node is thus connected to every node). The result of this is that every node can reach every other node by using at most 1 hop.
yggdrasil (tries to) maintain a permanent connection to every configured peer (about 27 currently). Additionally it connects to other yggdrasil nodes in the same LAN and maintains a permanent connection to those.
The following is output from an yggdrasil node on devnet:

~ # yggdrasilctl getpeers
Port                               Public Key                                             IP Address                Uptime   RX      TX     Pr                       URI                      
1       da6fed8f902e2973f6b28addda010951f57834282ecf812bc53f43c42c241d71    200:4b20:24e0:dfa3:ad18:129a:ea44:4bfd  1m49s    1kb     1kb    0   tcp://gent04.grid.tf:9943                       
2       3dad15a03508a536ba649991eada818b204110e97ecf88eeb610f5d07ca3bc57    202:1297:52fe:57ba:d64a:2cdb:3370:a92b  1m49s    1kb     1kb    0   tcp://gent02.grid.tf:9943                       
3       f27809e920be99256654ca6f824eae8b3db5e71d325dd2ada3a753936f14b762    200:1b0f:ec2d:be82:cdb5:3356:6b20:fb62  1m49s    1kb     1kb    0   tcp://gent01.grid.tf:9943                       
4       3c4fc620f9f6d1b67949169cf19eb7cd1d1742f565638370bc2df2517152b0cc    202:1d81:cef8:3049:724c:35b7:4b18:730a  1m49s    1kb     1kb    0   tcp://gent03.grid.tf:9943                       
5       e8dd64e0da1e1f372f4fc17e77861270e1f699451174cbf670c5e7446e12a2a8    200:2e45:363e:4bc3:c191:a160:7d03:10f3  1m49s    1kb     1kb    0   tls://[fe80::ac4b:2aff:feda:af%npub6]:53663     
6       2b75d8d97dd26a74734f917469a9203634584d2dc595b91c2dc1172f7143604b    202:a451:3934:116c:ac5c:6583:745c:b2b6  1m49s    1kb     1kb    0   tls://[fe80::9831:41ff:fe9f:3860%npub6]:59662   
7       519d9da200370ffa5ba44efb25d456a6351ab9af434ed02450af4f818cd42e28    201:b989:8977:ff23:c016:916e:c413:68ae  1m49s   177kb   99kb    0   tls://[fe80::f82b:68ff:fea7:f306%npub6]:4627    
8       3452a169bbaabb1521ee831d613f5c6006315e42e51281e1c416d6aa5d8c0337    202:5d6a:f4b2:22aa:2756:f08b:e714:f605  1m49s    1kb     1kb    0   tls://[fe80::c09d:95ff:fe65:3108%npub6]:31001   
9       691b697304b86b285e1ee4ef749d56d4dcd1e2db5972f0205dd618a5aba51f6c    201:5b92:5a33:ed1e:535e:8784:6c42:2d8a  1m49s    1kb     1kb    0   tls://[fe80::d8ff:a0ff:fe12:9de1%npub6]:43934   
10      29dd4ac8128ffd6df85f15f73dcee3459731c790007cdda14f499a69d337b404    202:b115:a9bf:6b80:1490:3d07:5046:1188  1m49s    1kb     1kb    0   tls://[fe80::9c22:ff:fe6e:8fb8%npub6]:33671     
11      4478e6345aa35451ecc3c93069864b1e7415a402f4051f3af77ce280c3d0bbcb    201:ee1c:672e:9572:aeb8:4cf0:db3e:59e6  1m49s    1kb     1kb    0   tls://[fe80::d462:73ff:fef0:e6dd%npub6]:57791   
12      cb8311b01778df312f86811fa3465f6daef459e402dfacd20b03f701ce681754    200:68f9:dc9f:d10e:419d:a0f2:fdc0:b973  1m49s    1kb     1kb    0   tls://[fe80::415:f1ff:fe0f:c9b1%npub6]:47206    
13      55b1e0707aad8e7e610f994e3e4e89026bd5c28d1c757e318109f81dcdfd8046    201:a938:7e3e:1549:c606:7bc1:9ac7:6c5   1m49s    1kb     1kb    0   tls://[fe80::f823:1ff:fe0c:66f2%npub6]:46975    
14      14d906671ae880abe6c6d154cc5ac7b7a57caf903023c51fba0f76c9b3cc3ff9    203:b26f:998e:5177:f541:9392:eab3:3a53  1m49s    1kb     1kb    0   tls://[fe80::149b:c2ff:fe41:d0b9%npub6]:49839   
15      1047f53b798b16b33f5bf580a2c4350ad15050f4a3249a0ab8c39c68b2b938bb    203:fb80:ac48:674e:94cc:a40:a7f5:d3bc   1m49s    1kb     1kb    0   tls://[fe80::431:54ff:fe5d:62a3%npub6]:45135    
16      3aba23ca66c44e9a1433491536cf83640be391fdc4df55baf600afc003c22822    202:2a2e:e1ac:c9dd:8b2f:5e65:b756:4983  1m49s    1kb     1kb    0   tls://[fe80::d0e4:afff:feb0:7599%npub6]:2369    
17      c2e40dd266d04f119258fcd700976cf9b4803c948a62dec5890b991ad29eed23    200:7a37:e45b:325f:61dc:db4e:651:fed1   1m49s    1kb     1kb    0   tls://[fe80::8034:46ff:fe36:8588%npub6]:32851   
18      b6db80a4eaa43089fabf90e59191e9e49b3a4bf4bc886b84a1c80ffe3eb5eeed    200:9248:feb6:2ab7:9eec:a80:de34:dcdc   1m49s    1kb     1kb    0   tls://[fe80::9c68:58ff:fec5:f0d2%npub6]:38762   
19      fc7146bc6885be43f538780253068b4becb4ba9ee6af810bfaa2414cbf68fb03    200:71d:7287:2ef4:8378:158f:ffb:59f2    1m49s    1kb     1kb    0   tls://[fe80::c6f:2bff:fe63:6317%npub6]:30776    
20      dbc2c0af5258ddc6ec7040e0e1014a3561b72f3bfac8c7f471d91d082ba1e134    200:487a:7ea1:5b4e:4472:271f:7e3e:3dfd  1m49s    1kb     1kb    0   tls://[fe80::4ce9:edff:fe24:39c1%npub6]:6543    
21      7bf23a44770116431606d2efecb88c7cac7539942f6fcdab924dfb0a5f84208b    201:1037:16ee:23fb:a6f3:a7e4:b440:4d1d  1m49s    1kb     1kb    0   tls://[fe80::b868:4ff:fe50:ccf5%npub6]:30655    
22      f52123317578735b85448295d408872d2c4bb49b34851cc3947ebfbbc333ac50    200:15bd:b99d:150f:1948:f576:fad4:57ee  1m49s    1kb     1kb    0   tls://[fe80::ec18:f3ff:feb2:2b4c%npub6]:58064   
23      2de17c2b7497875efc73205a4f961463363c6cca6bb36a81ea6ef2855eece6fb    202:90f4:1ea4:5b43:c508:1c66:fd2d:834f  1m49s    1kb     1kb    0   tls://[fe80::a0a1:bfff:fe59:272d%npub6]:1785    
24      a12ef4e72a842d2be3b8c5095ff9e55777768554f98f2a577cbf8e70353bcd01    200:bda2:1631:aaf7:a5a8:388e:75ed:400c  1m49s    1kb     1kb    0   tls://[fe80::ccc1:81ff:fef4:c9ff%npub6]:18940   
25      b1daeb29c8fa2e0827c01119aba160b9ddc464026703fe5fd533319f3c15e441    200:9c4a:29ac:6e0b:a3ef:b07f:ddcc:a8bd  1m49s    1kb     1kb    0   tls://[fe80::c4e4:9eff:fefd:8513%npub6]:61754   
26      d365bf3a0dc60c1238ee270624d9d30ea526280d30a01207d8bd462187f6703f    200:5934:818b:e473:e7db:8e23:b1f3:b64c  1m49s    1kb     1kb    0   tls://[fe80::181b:cfff:fe20:63%npub6]:49725     
27      1104f819e655b0e4448ff0542aebe3a5547b9fc4d094cc1943d196b3d134474f    203:efb0:7e61:9aa4:f1bb:b700:fabd:5141  1m49s    1kb     1kb    0   tls://[fe80::8aa:abff:fe74:6ff5%npub6]:18301    
28      fbd5862cb73b02ab76e51851bb45c66b4dedf4eff5c862a6e5a798cf76b7486a    200:854:f3a6:9189:faa9:1235:cf5c:8974   1m49s    1kb     1kb    0   tls://[fe80::940f:40ff:fe98:768d%npub6]:11041   
29      9edf0e72efbd83d5feb5e47dc57afc8772e7f574b04c1126d471404a4655a4d1    200:c241:e31a:2084:f854:294:3704:750a   1m49s    1kb     1kb    0   tls://[fe80::5808:d0ff:fe5f:8b51%npub6]:14163   
30      1585bbcdeea49598958fb22ff3eb22d61c5152ffc98ec50abc809b3f437f2c4b    203:a7a4:4321:15b6:a676:a704:dd00:c14d  1m49s    1kb     1kb    0   tls://[fe80::10e1:d3ff:fe31:6d1e%npub6]:45914   
31      89c700a772b1733cbcc1bb2b777c88dbc9d0ef818f8681646c62f61283e292f6    200:ec71:feb1:1a9d:1986:867c:89a9:1106  1m49s    1kb     1kb    0   tls://[fe80::f8a7:edff:fe3a:75fb%npub6]:5827    
32      198ee8b6bf9b4266209145d340be012838223d6e20d31c96baa72f5709a86197    203:6711:7494:64b:d99d:f6eb:a2cb:f41f   1m49s    1kb     1kb    0   tls://[fe80::7433:5bff:fe6d:4a51%npub6]:56107   
33      529ade78b8400ae80ac8d7a03e5e099f1356ac5403f1031b8c7ffd824788c0d7    201:b594:861d:1eff:d45f:d4dc:a17f:687   1m49s    1kb     1kb    0   tls://[fe80::b40b:faff:fedb:7a7c%npub6]:54498   
34      5231fef6f699dacce7f0325b437f1f9453f24fc9ed4864b0d2a3e422ea538462    201:b738:424:2598:94cc:603f:3692:f203   1m49s    1kb     1kb    0   tls://[fe80::40f9:b5ff:fe38:6188%npub6]:26648   
35      96fd288d7170c29ca101fb1ef6db95f1f8d73342ede478358a3555073a795179    200:d205:aee5:1d1e:7ac6:bdfc:9c2:1248   1m49s    1kb     1kb    0   tls://[fe80::a832:1ff:fe2d:93c%npub6]:3001      
36      e2c64b37cd1f78c1f78eebb4d7ccf490342501feeefb9ec818ec17e0ef4333c5    200:3a73:6990:65c1:e7c:10e2:2896:5066   1m49s    1kb     1kb    0   tls://[fe80::d8fd:3fff:fee5:f485%npub6]:42031   
37      7009452e510f4a58b8ccd280747fb6e21720b66f9a4cca2b910a741907c78377    201:3fda:eb46:bbc2:d69d:1ccc:b5fe:2e01  1m49s    1kb     1kb    0   tls://[fe80::b402:79ff:fe7f:c01%npub6]:21702    
38      559099220c63bf504969dbc57be52194555b5a3828df2a05d1427f23cc8148e8    201:a9bd:9b77:ce71:2be:da58:90ea:106b   1m49s    1kb     1kb    0   tls://[fe80::bc70:e3ff:fe9a:18b7%npub6]:10527   
39      29ba131ad92b011f01ae0b67c8f0718f8d6688412f9d7e2020722c15f23e296f    202:b22f:6729:36a7:f707:f28f:a4c1:b87c  1m49s    1kb     1kb    0   tls://[fe80::c010:5eff:fe4b:d3b5%npub6]:43398   
40      345995ace65d1d282047528b2c9a4d25e190007d0ba4e52fa5f900416ebd5366    202:5d33:5298:cd17:16be:fdc5:6ba6:9b2d  1m49s    1kb     1kb    0   tls://[fe80::4880:4cff:fe59:7dda%npub6]:18369   
41      3f159975b2392521d1aa3611b015f0637e6cdd30bacc8bc137de761fcb38131a    202:753:3452:6e36:d6f1:72ae:4f72:7f50   1m49s    1kb     1kb    0   tls://[fe80::9c03:bdff:fe91:2d4%npub6]:10960    
42      99d53b539f000af0b93771b896ac42c809ec70ba77a36eb8095bbb65366d0bfa    200:cc55:8958:c1ff:ea1e:8d91:1c8e:d2a7  1m49s    1kb     1kb    0   tls://[fe80::48f6:3cff:fe42:4ac0%npub6]:2838    
43      77974766926d4d0d206218c1ad47d164fd4f36f64553e44c6d461e088b3b6237    201:21a2:e265:b64a:cbcb:7e77:9cf9:4ae0  1m49s    1kb     1kb    0   tls://[fe80::e4f6:aeff:fe5b:c9bd%npub6]:4105    
44      9ebec0675aad950eec0017c02a98148707df0c011db2b59bb076707c42eeaf0f    200:c282:7f31:4aa4:d5e2:27ff:d07f:aacf  1m49s    1kb     1kb    0   tls://[fe80::c48f:37ff:feff:e1bb%npub6]:9911    
45      7d4d676c4295306cbd47cf74c43d385134f6f900f3d09a1ab57a28581f31d080    201:aca:624e:f5ab:3e4d:ae0:c22c:ef0b    1m49s    1kb     1kb    0   tls://[fe80::4807:5eff:fe1d:7053%npub6]:46623   
46      3c29693a6e8b3857fef031b8b2bd832272f016873cc5b65834d2ccb87cd78ea8    202:1eb4:b62c:8ba6:3d40:87e:723a:6a13   1m49s    1kb     1kb    0   tls://[fe80::b408:ecff:febd:a192%npub6]:24629   
47      28f484fd9b7744177afb9787755d1aa7ca76860e15312e18cbbca32d6fad6a77    202:b85b:d813:2445:df44:2823:43c4:5517  1m49s    1kb     1kb    0   tls://[fe80::3443:a6ff:fe1d:2c8d%npub6]:20263   
48      73c7c5c3d04ed5e14182a1ba06ec5d4395fd26acc597dd40fa875d8352ed53c6    201:30e0:e8f0:bec4:a87a:f9f5:7917:e44e  1m49s    1kb     1kb    0   tls://[fe80::74c0:6bff:fe8d:36a8%npub6]:42095   
49      76b567e02c25b0f100325c4626da4f4690db1eedf2cf15ac74917797cee88145    201:252a:607f:4f69:3c3b:ff36:8ee7:6496  1m49s    1kb     1kb    0   tls://[fe80::b0dd:e2ff:fe0c:b557%npub6]:21492   
50      b02a6992c5b129a2b784ca8b2ef9b14c007dffcc31a70b3935432f314b33e883    200:9fab:2cda:749d:acba:90f6:6ae9:a20c  1m49s    1kb     1kb    0   tls://[fe80::746d:f1ff:fe66:d7f5%npub6]:1640    
51      62a9549aa9e4a4976fa9828a004172c0480229857ae36e7e891d2117b5709c0d    201:755a:ad95:586d:6da2:4159:f5d7:fefa  1m49s    1kb     1kb    0   tls://[fe80::ec59:e5ff:fefa:6ff7%npub6]:31259   
52      c4e2799b54a9ebbe1831ceb8e23376cfc46ea8c0a04b45280764b3afd8065a37    200:763b:cc9:56ac:2883:cf9c:628e:3b99   1m49s    1kb     1kb    0   tls://[fe80::cca9:feff:fe4f:65cd%npub6]:28546   
53      afe3ddd4922cacc07417a4031ae1fd708f4bda457172c1238cb76d1941aec4a4    200:a038:4456:dba6:a67f:17d0:b7f9:ca3c  1m49s    1kb     1kb    0   tls://[fe80::d0ee:7ff:fec4:c100%npub6]:21542    
54      799eaa9d9f9030c58722d11438dadf6578659e60a40f60702e8e9f9fc53d126a    201:1985:5589:81bf:3ce9:e374:bbaf:1c94  1m49s    1kb     1kb    0   tls://[fe80::b886:b6ff:fe01:173f%npub6]:25627   
55      882c5a6c783816a57a1bae0d6b11d26829ca4755f242304c46633f84711c41c7    200:efa7:4b27:f8f:d2b5:bc8:a3e5:29dc    1m49s    1kb     1kb    0   tls://[fe80::74c7:f5ff:fe16:6015%npub6]:41120   
56      f08cf3df720b579d58d1346999b2d4b642c6ba22669c8b4962cea73e192ec81d    200:1ee6:1841:1be9:50c5:4e5d:972c:cc9a  1m49s    1kb     1kb    0   tls://[fe80::e08a:6eff:fecf:e2f8%npub6]:39546   
57      7bee356386fe0891a41ac03897e87e3f7a5bbf0a3df70a82daa85c1389f275e7    201:1047:2a71:e407:ddb9:6f94:ff1d:a05e  1m49s    1kb     1kb    0   tcp://gw424.vienna2.greenedgecloud.com:9943     
58      da6ec3990a23b55edd3a8d7e4d09177ab0c422c0d5bd801cf317b957ab4aadde    200:4b22:78cd:ebb8:9542:458a:e503:65ed  1m49s    1kb     1kb    0   tcp://gw327.salzburg1.greenedgecloud.com:9943   
59      fe2dc4af3568e067a7a1013bc00ccf58bdb5270cec8210dfe0188d4841819327    200:3a4:76a1:952e:3f30:b0bd:fd88:7fe6   1m49s    1kb     1kb    0   tcp://gw298.vienna1.greenedgecloud.com:9943     
60      1f660bcc78a5a81c9441a9b04210e9ff158fbf784bbc88434b1a3f8868ec00b9    203:99f:4338:75a5:7e36:bbe5:64fb:def1   1m49s    1kb     1kb    0   tcp://gw294.vienna1.greenedgecloud.com:9943     
61      f585bfcd846bcebaae4da707b7093acc95225dbf171abb176aaa08b4319b2a58    200:14f4:8064:f728:628a:a364:b1f0:91ed  1m49s    1kb     1kb    0   tcp://gw306.vienna2.greenedgecloud.com:9943     
62      b02a7e7968ea50fb0071ee188e7958645737638a156a79250a93af5ace6a2ffe    200:9fab:30d:2e2b:5e09:ff1c:23ce:e30d   1m49s    1kb     1kb    0   tcp://gw422.vienna2.greenedgecloud.com:9943     
63      3f4227b05b4847aa3fbdfee1125e4c03141827af9ee7df59841764d6ada53b36    202:5ee:c27d:25bd:c2ae:210:8f7:6d0d     1m49s    1kb     1kb    0   tcp://gw293.vienna1.greenedgecloud.com:9943     
64      e97d55fa866f71db4a9a725e65e406dc1d6167054a7c04d1663b36c981d438eb    200:2d05:540a:f321:1c49:6acb:1b43:3437  1m49s    1kb     1kb    0   tcp://gw423.vienna2.greenedgecloud.com:9943     
65      a87607fa7039e1fffedfaaa11cc201e971fda1a0f131af950e3c5002b2154d7b    200:af13:f00b:1f8c:3c00:240:aabd:c67b   1m49s    1kb     1kb    0   tcp://gw307.vienna2.greenedgecloud.com:9943     
66      2c3ff177189cd1d05b050f2f04fc787d8f5b442db06e8bad03ccefbaf1d8f497    202:9e00:7447:3b19:717d:27d7:8687:d81c  1m49s    1kb     1kb    0   tcp://gw304.vienna2.greenedgecloud.com:9943     
67      398d132a0890beec1006dd9e67730533450959511f0e19f3d76307dc8095fafd    202:3397:66af:bb7a:89f:7fc9:130c:c467   1m49s    1kb     1kb    0   tcp://gw309.vienna2.greenedgecloud.com:9943     
68      3b37580f125ea77340230d3999894d5c10bde24347eeca9889a7a459c6926f0d    202:2645:3f87:6d0a:c465:fee7:9633:33b5  1m49s    1kb     1kb    0   tcp://gw425.vienna2.greenedgecloud.com:9943     
69      d66cf25bab853a6f3390dd554018c51c8f8837849f7ff00b2e7e18c3e094b1aa    200:5326:1b48:a8f5:8b21:98de:4555:7fce  1m49s    1kb     1kb    0   tcp://gw297.vienna1.greenedgecloud.com:9943     
70      216693abb54853ff9a76f0f135a8f7fc84edbe67a15342529b5486561c166013    202:f4cb:62a2:55bd:6003:2c48:7876:52b8  1m49s    1kb     1kb    0   tcp://gw300.vienna2.greenedgecloud.com:9943     
71      536fca6659493cd4d7f0aaf502fc91d492758a165b1f5b699a7d29b0bc032ea2    201:b240:d666:9adb:cac:a03d:542b:f40d   1m49s    1kb     1kb    0   tcp://gw328.salzburg1.greenedgecloud.com:9943   
72      62f4c62846bbfcabbb61929ea7b881ace710fd12fddc53deae98151f1913f3dd    201:742c:e75e:e510:d51:1279:b585:611d   1m49s    1kb     1kb    0   tcp://gw331.salzburg1.greenedgecloud.com:9943   
73      76c591e40504f5ca6bdada769ed7a150d7a4af0fe0879d9987af0ea71cd91822    201:24e9:b86f:ebec:28d6:5094:9625:84a1  1m49s    1kb     1kb    0   tcp://gw333.salzburg1.greenedgecloud.com:9943   
74      d50bb3c4ec2306b32421b605551545d794b200be4ff3a0e3afbd9b7fba313596    200:55e8:9876:27b9:f299:b7bc:93f5:55d5  1m49s    1kb     1kb    0   tcp://gw330.salzburg1.greenedgecloud.com:9943   
75      ac24a420c7d8d99ad9c8d0b85cb1a5eca5ed5e23c2ef15b0b58565126870a8c3    200:a7b6:b7be:704e:4cca:4c6e:5e8f:469c  1m49s    1kb     1kb    0   tcp://gw299.vienna2.greenedgecloud.com:9943     
76      934b67801fafe311def958e6e804706cdf7b3718741d07f5fe06cdc006f06c90    200:d969:30ff:c0a0:39dc:420d:4e32:2ff7  1m49s    1kb     1kb    0   tcp://gw324.salzburg1.greenedgecloud.com:9943   
77      b946c9fd97f7bd8b6d3000588a8b2d3c35324a870061ce79e449bc706abb6af0    200:8d72:6c04:d010:84e9:259f:ff4e:eae9  1m49s    1kb     1kb    0   tcp://gw326.salzburg1.greenedgecloud.com:9943   
78      f8eae7c7ebb498fcb1f211add4ff024ff7a8cb726b5d663d1d67715a3c672e32    200:e2a:3070:2896:ce06:9c1b:dca4:5601   1m18s    2kb     1kb    0   tcp://gw313.vienna2.greenedgecloud.com:9943     
79      ed9c38780c862abe03b6b3bfd0e174be5b4f7d8a3b3a5b2c7995c2636e921090    200:24c7:8f0f:e6f3:aa83:f892:9880:5e3d  1m49s    1kb     1kb    0   tcp://gw291.vienna1.greenedgecloud.com:9943 

~ # yggdrasilctl getdht
                           Public Key                                             IP Address                Port    Rest 
2c3390578917b583589a77b018a2fe68dedfe8dbdfa95a3234f2fc5024ab948d    202:9e63:7d43:b742:53e5:3b2c:427f:3ae8  0       66      
2c3ff177189cd1d05b050f2f04fc787d8f5b442db06e8bad03ccefbaf1d8f497    202:9e00:7447:3b19:717d:27d7:8687:d81c  66      0 

~ # yggdrasilctl getsessions
                           Public Key                                             IP Address                Uptime   RX      TX   
519d9da200370ffa5ba44efb25d456a6351ab9af434ed02450af4f818cd42e28    201:b989:8977:ff23:c016:916e:c413:68ae  25m29s   2mb    563kb

~ # yggdrasilctl getself
Build name:     yggdrasil                                                           
Build version:  0.4.7                                                               
IPv6 address:   202:9e63:7d43:b742:53e5:3b2c:427f:3ae8                              
IPv6 subnet:    302:9e63:7d43:b742::/64                                             
Coordinates:    [1 70 672 33 434]                                                   
Public key:     2c3390578917b583589a77b018a2fe68dedfe8dbdfa95a3234f2fc5024ab948d

Notice that we have 79 known peers, this corresponds roughly with the amount of connections kept open (there are slightly more open connections). There is 1 active session, and we only have 2 known DHT entries. This is odd, since we are connected to 79 known subnets directly, and 1 indirectly. In other words, a connection to a subnet hosted on a directly connected peer seems to require lookups through the DHT. Also important, the extended coordinates in the getself output seem to indicate that the node is pretty far from the root (issues on yggdrasil network repo seem to indicate the network is in fact rooted in its current shape).

The above would lead me to think (again, without having hard evidence) that the problem is fundamentally in the working of the DHT. And since this is the fundamental idea of yggdrasil itself, changing that would be a massive undertaking.

If we consider the current network topology sufficient, I think it is instead better to create a from scratch implementation. One where we don't have a DHT,. The main idea is simple: we connect to a static peer list. As explained above all nodes are connected to all public peers. When some node wants to reach some address, it asks all its peers if it is connected to the address, to filter the possible paths, and then picks one. It is trivial to implement a periodic ping as a kind of latency check, which can then be advertised by peers, to select the "shortest path". The initial peer sends a request to the pub node to connect it to the remote, and the pub node sends a request on the persistent connection with the remote for said remote to initiate a new connection with the public peer. The initial node also opens a new connection with the public node, and the public node simply splices both connections together to bridge traffic. This setup also reduces impact of malicious nodes in the network, since the public peers are configured statically and we only have 1 hop max over these public peers. Underlay encryption can be implemented with a simple self signed TLS certificate, which can optionally be extended such that the signature is signed by the private key of the keypair used to generate the address. This way we also embed authentication in the TLS certificate. This should be sufficient for a first version and can later be improved.

Once again I'd like to stress that the above is a suggestion based on observations and (presumably educated) guesses about the cause of the current situation, and if anyone disagrees or have evidence otherwise, please add this to the discussion. Hand rolling a new setup is also going to take development resources, has no guarantees of being better (again due to lack of certainty about the cause of these issues), and from scratching is rarely the correct solution.

Thanks @LeeSmet. Yes, my proposition was as a temporary workaround to avoid immediate fallout from issues with Yggdrasil, namely loss of confidence in the grid as a whole when users can't create deployments. If this can buy us some time to develop a proper solution, I think it's worth considering, as a low investment and easily reversible course of action that seems likely to return us to a better performing state. It can also help by reducing the overall complexity of the situation we're trying to analyze.

I agree that understanding the root cause is essential to finding the right long term solution. If anyone following this issue can tag in others who might be able to provide insights, please do so.

I agree with scott, we need a solution as soon as we can

After taking some time to investigate in a bit more detail, here are some findings:

The network itself is build primarily as a spanning tree. The root is arbitrary, though the code tries to select the node with highest ID (IP address) as root. Since IP is derived from public key and counts amount of leading 1's after flipping the key, the root is generally the node where the public key is the "smallest" (most amount of leading 0's).
Our network is indeed connected to the larger yggdrasil network through at least 1 peer. However this peer seems stable (hasn't switched during my observation of the network), meaning path wise, even if there is another peer connected to the largr network, all paths share a hop prefix. Basically this means that, without shortcuts, all packets are routed through this peer, at which point they should go back into our network. Now, according to comments in issues, yggdrasil should try to take shortcuts. Since this peer seems to be connected to our public nodes, and our public nodes are connected to one another, a packet from Node A connected to pub node B, going to node C connected to pub node D, would follow the path [A B P D C] where P is the peer connected to the larger network. However as stated shortcuts are considered, and B should detects its connection to D, sending the packet directly to D, causing the path to become [A B D C]. In theory.
As per the above, our network topology is essentially a subtree part of the larger global yggdrasil tree. It is in itself spanning in P. Due to P's connection to the larger network, it will always be the root of our sub network as long as it operates reliable.
The peer P is actively connecting to our public nodes and also the larger yggdrasil network. Note that the advertised peer list only includes our own nodes.
The coordinates of the node I checked changed multiple times, becoming shorter and longer throughout the day. Since a comment I found indicates that the coordinates are actually used for routing, this would mean that packets in transit through the network as the network reorganizes (thus the coordinates changing) might be lost. Hard to measure the impact here, since only the "best" coordinates are seen, but technically a node should have a whole set of coordinates, as coordinates are the index of the child slot a peer is assigned starting from the root, i.e. [1 3 5] means the node is the 5th child of the 3rd child of the first child of the root. The root is at [ ].
It would be interesting to monitor more nodes, specifically their coordinates, so we can detect coordinate changes, and then cross reference this with the latency graph from blackbox, to check if there is a correlation here.
It would also be interesting to see how many coordinate changes happen in general, to get an idea of how often the network "rebalances"
At some point during the day, the root changed to be very much not the highest ID.
There is a handshake between peers here it seems. It would be rather trivial to change this in a breaking way to prevent peering with regular yggdrasil nodes, and make our own network (assuming there is no rogue actor who modifies the code to accept both and run that of course).

Based on the above, here are some options:

Go through with forking yggdrasil and breaking the handshake, such that we are more isolated. In this scenario, it seems pretty important that our public peers generate "small" keys (leading 0's). Reasoning for this is that we want one of these to be the root, and the others to be stand by to become the root if something happens to the existing root. If we don't do this, a hidden node (possibly on a crap connection) could become the new root. Then, given node A pathing to public peer B wants to connect to node C with best public peer D, with root (on crap connection) R, the path would be [A B R D C]. In theory since we use our own peer list, C would be connected to B as well, so the path should be [A B C], though current behaviour makes me not optimistic that this (always) works. Conversely, if one of the public nodes is the root, then the path would be [A B C] by default, since it seems all peers prefer to have the lowest key as root, and since the network root is by default the lowest key, all peers connected to it will choose it as main root, hence not requiring to invoke any special shortcut behavior. This also prevents the network from rearanging when random nodes join, as random nodes with random keys could randomly have the smallest key, thus triggering a network reorg when the node joins. This is especially bad in case it is a temporary node (e.g. laptop) which "flickers" into and out of the network, as it would constantly cause a topology change. Lastly, I'll again iterate the point that I am not confident this approach will solve all our issues, especially not permanently.
Fork yggdrasil, throw out the DHT implementation, and replace it with a simple "single hop" implementation. Probably the worst option, as we aren't really sure what the actual dependency is on the DHT and we don't have anyone intimately familiar with the codebase. Also we'd probably be best to replace the handshake here to avoid weird stuff with the regular yggdrasil network.
From scratch something similar, as explained in my original comment. Here we have the advantage of not having the coordinate routing, and if a session is started to a peer (yggdrasil also has the concept of peer sessions), we can open an encrypted (tls?) session (through a peer which can efficiently splice both peer connections). Again, as stated I am generally not a fan of a from scratch implementation, and this is not scaling infinitely, but would give a reliable setup for our current configuration.

After some thining, I wonder if we can (for now) make at least our pub nodes connect directly to the root node. This would reduce the depth of our nodes in the tree, meaning we need less hops in general for packets to be routed in the worst case. Additionally, and more importantly, this would allow us to bypass the unknown nodes which are now bridging our network to the global yggdrasil network, and might be not optimal for that.

Considering connectivity is currently rather flaky already, it's doubtfull that it could get worse anyway.

The entire grid is currently reporting down In the explorer and by /status in the status bot, this appears to be planetary related. I’m showing no errors in node console.

This effect coincided with a farmer mass deploying 72 nodes across 6 racks within an hour.

https://github.com/threefoldtech/test_feedback/issues/363

Good idea there, if that could at least improve a little. For that, we would add the root node to the existing peer list? Or only have the root node as a peer for ZOS nodes?

It seems the root node is not publicly available, but instead a single node connected to it is. This is good enough as only a single node connection means this will be a common coordinate for the worst case. This node can be configured by adding

"tls://163.172.31.60:12221?key=060f2d49c6a1a2066357ea06e58f5cff8c76a5c0cc513ceb2dab75c900fe183b&sni=jorropo.net",
"tls://jorropo.net:12221?key=060f2d49c6a1a2066357ea06e58f5cff8c76a5c0cc513ceb2dab75c900fe183b&sni=jorropo.net"

to the peer list.

Can be done for gridproxy and blackbox already, for zos this should be added to our own pub nodes but preferably not every node (not sure how that will behave if that node suddenly gets bursted by 5K connections).

What about a set of permitter nodes?

could we overlay a reliable halo using a deployment co located with the public peers, with a public ip set and no planetary network interface, then run yggdrasil and

peer that workload between the public nodes and either this one peer,
or, reliable yggdrasil public peers local to ours found here (https://publicpeers.neilalexander.dev/)
i have some testing results from when i was hosting a yggdrasil public node alongside my nodes here ( https://forum.threefold.io/t/yggdrasil-nodes/2795) including a list of peers i tested have decent thoroughput
- to clairify, this hasnt been up in months because mad that box a 3node :)

looking at the overall size of the yggdrasil network, theres 5900 nodes showing online, could it be possible when our root moves, the entire networks root is impacted because of the size of our tree vs the entire network? if we were to run the gen keys commands on these perimeter nodes and get their addresses theoretically we should be able to get their address down so that if the root moves, it moves to a reliable node that is lan speed to the public node it lives with.

gen keys documentation here https://yggdrasil-network.github.io/configuration.html

if 72 nodes from a single farm behind one ip address all connect to the public peers simultaneously would it not create a situation where the public peers may only have connections open with that one farm, that currently has a very reliable but low bandwidth connection? temporarily disabling the network until new peers have naturally reestablished connections outside of that farm?

here is my thought process at 22:52 gmt-6 central us Saturday, Michael and Thangwook started reporting what would later be discovered to be that an indexer had fallen out of sync due to i/o timeout.

Michael deployed 70+ new nodes behind a single public address around this time,

at 23:00 I found that the explorer was showing all nodes down, /status was non functional in the bot, and i could not connect to the planetary network in the connect app, all Yggdrasil services were unresponsive,

i could deploy workloads on testnet on us servers / i could not deploy workloads on nodes in europe, outside of the foundation and greenedge, those deployments were taking two attempts an moving slow the errors were derivates of,

            Couldn't get free Wireguard ports for node 4406 due to Error: Request failed with status code 502 due to failed to submit message: Post "http://301:de3e:5fe6:f341:21bc:ce3d:7927:1ebc:8051/zbus-cmd": dial tcp [301:de3e:5fe6:f341:21bc:ce3d:7927:1ebc]:8051: i/o timeout

i was not able to deploy on mainet for a short period and then it began behaving as testnet
- Michaels nodes were randomly freezing at different percentages while booting during this time,
  - he was able to fix this by separating nodes into a different vlan than his other nodes
  - his nat table currently shows established connections to the public peers across all 123 nodes behind 1 ip
  - pre vlan separation, 123 nodes were multi-cast peer discovering each other

im theorizing this rare condition that Michael has created multiple times since july/august when began to have over 50 nodes, is the cause of what lee observed when the root shifted, Michael has i believe 200mbit fiber to the nodes currently but plans to bring in a bigger connection,

tl;dr

but in the past week we have had a condition where

123 nodes each were creating 20 connections each to the publics peers sometimes within 1 hour, that's 2460 new connections from a single ip address to the public peers.
123 nodes were creating creating multicast peering with each other, while possibly being the only tunnels open currently as there is only approximately 72 open at one time.

I'm theorizing that this storm of connections is overload the peering structure by creating short term feedback tunnels, where the public peers and redundant connections to the same farm, when that farm happens to have enough nodes to overwhelm the peers, and those nodes connect in fast enough succession, such that nodes outside of that farm have not also created new connections to provide alternative routes. if this is in fact what is happening, i believe it would temporarily cause all traffic to timeout in this loop.

Or simply, large farms cycling through the open peer connections, are disrupting the ability of planetary to maintain a healthy mesh add to this, we may have 1 node, attaching our subnet, which is potentially as large as the entire main network, i think this is why connections have been degrading ultimately michael and dany sing started deploying these 25-50+ node farms around the time we started having issues.

this my flow of how this farmers nodes, which represent a significant outlier in farm size may have taken down the gridproxy

the peers hold open about 72 connections at a time each, the open connections replace the oldest tunnels,
theirs currently 20 nodes on the peer list
20 peers x 72 connections 1,440 connections of capacity currently every node connects to ALL peers
true capacity is 72 mirrored connections
boots under 72 nodes could cause degradation by replacing some to most of the connections
boots over 72 nodes could possibly have the capacity to replace all of the public peers connections temporarily shutting down the grid proxy
Saturday night was the first time Michael went significantly over 72 nodes bootting in short succession (under 1 hour)....

supporting variable to Michaels farm being a possible cause

us node deployment was temporarily performing better then Europeans, not the normal condition for me
Europeans nodes co-located with the public gateways were performing better then those not.
the problem has improved with time but the degradation has worsened, there is still a string of connections happening from the nose on interval, this could be the cause of the timeouts disappearing for short periods as new nodes connect and those connection's restore the mesh.

this the current map of the yggdrasil network highlighted from the perspective of GENT04

this is our subnet, with gent 04 circled

this is what my plan of halos will make the map look like for gent04 (this was my public node, effectively peered with my nodes (6months ago)

this is what happens when one rogue node forces itself into the root of the network (this was my public node, not behaving(6 months ago)

also a coordinate of 1, seems to notate the node is accessible from the root this seems to happen without a peer entry for the root.

After some more investigation, it turns out there is actually a strong correlation between lag spikes on the chart of our pub node rtt times and data throughput on the public interface of these nodes.

If we isolate a node on the rtt graph and its public incoming traffic on the node, we can actually see a (near) perfect correlation:

bit unfortunate the graphs don't line up but timestamps should be sufficient to see the correlation.

I queried each of our public peers to see how many peer connections they report. This is using the remote debug version of yggdrasilctl getpeers as demonstrated by the crawler. My script can be found here.

Each node from our list reports that it has exactly 1657 peers, with 2983 unique entries among them. That's more than the number of live nodes according to stats.grid.tf, which is encouraging. After running the script again about an hour later, I see the unique entries have fluctuated by about 20 but each node still reports 1657 peers. 3Nodes come and go, and there are other Yggdrasil nodes connected to our peers. Still, that's a relatively large fluctuation that could suggest there is some "fighting" among nodes to connect to the public peers.

I also checked how many of the peers from our list appeared in the lists returned from each of them. Since there are 31 total nodes in our list and they are all configured to connect to each other, we might expect to see values at or close to 30. Instead, I see values between 5-20. Also of note, these values fluctuate between immediate subsequent executions of the script, suggesting that the connections between the public peers are rather dynamic.

Finally, I checked how many of the peers also reported my machine as a peer in their reply. Across a several executions in the course of a few minutes, I saw this figure start at 16, rise to 21, fall to 14, then rise back to 18.

If this data is accurate (ie, the remote debug feature actually retrieves a full peer list from each remote), then the topology even with our subnet is shifting rapidly, minute to minute. The most concerning part to me is that our public peers don't seem to be maintaining stable connections to each other. If Yggdrasil nodes are limited to 1657 peers, and connections start to churn when more peers are attempting to connect, then we could expect to see increasing performance degradation as the size of Grid 3 grows beyond that limit. A large farm joining all at once could indeed also make the situation temporarily much worse.

I think it could be worth opening an issue for the Yggdrasil team with these findings. It's very possible that no other nodes in the network have this many peers attempting inbound connections (I queried the node one hop away from the root that's listed above, and it had ~200 peers), so there could be a scaling issue that Yggdrasil will need to solve anyway and would also affect a fork.

Strictly observational, But I found when deploying my public nodes that connections to over about 50 peers would start rotating connections, and when I crossed about 75 my nodes would crash on boot when they attempted to connect to all of them at once.

Realistically yggdrasil isn’t designed to have a centralized exchange, the concept is for nodes to peer with other local nodes around them furthering the mesh by the peers being connected almost in a cascade across a geographical area.

I think we should aim for each of the public peers to be able to maintain about 50 connections by their peers list, to do this we have to consider both the peers we add to the list, and the ones that will be found over multi cast.

We could functionally make this happen in the short term by creating small cell interchanges.

ultimately I think having cells of 22 peer co located clusters is the answer to work with what we have until a solution is found to the limitations were hitting,

So,

we have 22 peers in gent isolated on a vlan from all other nodes running yggdrasil

we have 22 peers in Salzburg isolated on a vlan from all other nodes running yggdrasil

So we have 22 peers in st gallen isolated on a vlan from all other nodes running yggdrasil

each cluster of 22 peers will find its collocated partners over multi cast peering

Gent Peers->Salzburg St.Gallen peers to ->Salzburg

Salzburg peers to odd gent nodes Salzburg peers to even st Galen nodes

This would make Gents peer list Salzburg 1-22 configured Gent 1-22 by multicast

This would make the St Gallen peer list Salzburg 1-22 configured St gallen 1-22 by multicast

The Salzburg peer list would be 11 odd Gent nodes 11 even St Gallen nodes 22 Salzburg nodes by

this SHOULD make all of our public peers stay reliably connected, and provide enough capacity to support our current node structure, if it works the model can scale with the network as we add new gateway farms.

this appears to show adding the public peering effectively having put our subnet behind the public peers. the nodes that make up the wings are lose associated with node ids of the public peer list, and 3nodes are the large orb

a87607fa7039e1fffedfaaa11cc201e971fda1a0f131af950e3c5002b2154d7b 200:af13:f00b:1f8c:3c00:240:aabd:c67b 1m49s 1kb 1kb 0 tcp://gw307.vienna2.greenedgecloud.com:9943

vienna 2 was the root of our network at this point in time that things looked to be working well.

Checked some of our public nodes, they consistently have ~3K peers. While monitoring the peers of one of these, there were occasions were almost all of them got removed at once except for the newly added root peer and 1 other. They got added themselves again over a couple of seconds. There seems to be a correlations of this occurring and the latency spikes.

I've changed the binaries of the 2 devnet nodes with a local version with basic prometheus instrumentation tracking how many of each type of packet is sent/received, and how many peers are added/removed in general. Lets see if this has some info by tomorrow.

Graphs of the packets going through these nodes show a correlation between ping spikes to these nodes and "tree" packets coming in. This indicates the network is reorganizing. Also peer count was mostly steady (except for some instances yesterday), so this seems to be an unrelated issue. Considering that our public nodes have stable connections to eachother (at least they are always connected when I get the peer lists on these public nodes on devnet), this points to the network reorganization being triggered by "external" nodes (i.e. the yggdrasil network itself, which we inherently don't control).

After some deliberation, it was decided to not fork yggdrasil at this time, and remain part of the ecosystem. We hope that future developments will increase the stability of the network, but won't actively make changes to the current situation.

To solve the issues for the grid, which is the fact that RMB runs over yggdrasil and thus might cause temporal node instability, it was decided to introduce better proxy support, such that public nodes on the "regular" internet can proxy messages to "hidden" nodes. This removes the dependency on yggdrasil. See threefoldtech/home/issues/1373 for that.

threefoldtech / test_feedback

Yggdrasil connectivity degradation #367