Closed coesensbert closed 1 year ago
To add on to this: This also impact the freeflow.life demo environment. We have had several demos go wrong as a result of yggdrasil connection issues.
Requested ZOS client upgrade: https://github.com/threefoldtech/zos/issues/1846
Over the last few days, we've seen this issue manifesting in reports from farmers and users of the grid, that Yggdrasil connections are timing out.
To reiterate the urgency brought above: this is a showstopper for the grid.
I wonder if, at least as a temporary solution, we can regain performance by forking Yggdrasil and running Planetary Network as a separate network.
Interoperability with the larger Yggdrasil network seems to be more of a bug than a feature right now, especially if topology changes (growth) outside the grid can degrade performance inside the grid. I'm not aware of any use cases that involve communicating with servers or users in the larger Yggdrasil network, at the moment.
I discussed with @despiegk, who supports forking Planetary Network into a separate network. This would require a coordinated effort among the development groups and would need to be considered against current high priority items like power management (#1303).
I'll invite input here from @muhamadazmy, @LeeSmet, and @xmonader for opinions on the suitability and feasibility of this plan or any alternatives that should be considered. Let's find consensus and outline the steps needed to execute so we can keep the grid running smoothly.
I'm very much not a fan of creating a fork.
Firstly, creating a fork means we are not fully compatible with the upstream. As a result, we can't just pull in upstream changes, meaning we need to actively maintain this fork, which is a continuous investment of development resources. Secondly, it doesn't aim to fix the root cause, but rather to work around it. Even if this works, there is absolutely no guarantee that it happens again later as the network gets bigger. Thirdly, and most importantly, we fundamentally don't know what the actual issue is. All we have are observations and guesses based on said observations. So before we even start doing anything, we need to either determine what the problem is, or analyze the behavior we see compared to what we know (and expect given experience in networking) and agree on at least a probable cause to try and improve/fix.
For arguments sake, the following section is an analysis of the current situation and my opinion on the matter (which is not backed by hard facts, hence opinion).
~ # yggdrasilctl getpeers
Port Public Key IP Address Uptime RX TX Pr URI
1 da6fed8f902e2973f6b28addda010951f57834282ecf812bc53f43c42c241d71 200:4b20:24e0:dfa3:ad18:129a:ea44:4bfd 1m49s 1kb 1kb 0 tcp://gent04.grid.tf:9943
2 3dad15a03508a536ba649991eada818b204110e97ecf88eeb610f5d07ca3bc57 202:1297:52fe:57ba:d64a:2cdb:3370:a92b 1m49s 1kb 1kb 0 tcp://gent02.grid.tf:9943
3 f27809e920be99256654ca6f824eae8b3db5e71d325dd2ada3a753936f14b762 200:1b0f:ec2d:be82:cdb5:3356:6b20:fb62 1m49s 1kb 1kb 0 tcp://gent01.grid.tf:9943
4 3c4fc620f9f6d1b67949169cf19eb7cd1d1742f565638370bc2df2517152b0cc 202:1d81:cef8:3049:724c:35b7:4b18:730a 1m49s 1kb 1kb 0 tcp://gent03.grid.tf:9943
5 e8dd64e0da1e1f372f4fc17e77861270e1f699451174cbf670c5e7446e12a2a8 200:2e45:363e:4bc3:c191:a160:7d03:10f3 1m49s 1kb 1kb 0 tls://[fe80::ac4b:2aff:feda:af%npub6]:53663
6 2b75d8d97dd26a74734f917469a9203634584d2dc595b91c2dc1172f7143604b 202:a451:3934:116c:ac5c:6583:745c:b2b6 1m49s 1kb 1kb 0 tls://[fe80::9831:41ff:fe9f:3860%npub6]:59662
7 519d9da200370ffa5ba44efb25d456a6351ab9af434ed02450af4f818cd42e28 201:b989:8977:ff23:c016:916e:c413:68ae 1m49s 177kb 99kb 0 tls://[fe80::f82b:68ff:fea7:f306%npub6]:4627
8 3452a169bbaabb1521ee831d613f5c6006315e42e51281e1c416d6aa5d8c0337 202:5d6a:f4b2:22aa:2756:f08b:e714:f605 1m49s 1kb 1kb 0 tls://[fe80::c09d:95ff:fe65:3108%npub6]:31001
9 691b697304b86b285e1ee4ef749d56d4dcd1e2db5972f0205dd618a5aba51f6c 201:5b92:5a33:ed1e:535e:8784:6c42:2d8a 1m49s 1kb 1kb 0 tls://[fe80::d8ff:a0ff:fe12:9de1%npub6]:43934
10 29dd4ac8128ffd6df85f15f73dcee3459731c790007cdda14f499a69d337b404 202:b115:a9bf:6b80:1490:3d07:5046:1188 1m49s 1kb 1kb 0 tls://[fe80::9c22:ff:fe6e:8fb8%npub6]:33671
11 4478e6345aa35451ecc3c93069864b1e7415a402f4051f3af77ce280c3d0bbcb 201:ee1c:672e:9572:aeb8:4cf0:db3e:59e6 1m49s 1kb 1kb 0 tls://[fe80::d462:73ff:fef0:e6dd%npub6]:57791
12 cb8311b01778df312f86811fa3465f6daef459e402dfacd20b03f701ce681754 200:68f9:dc9f:d10e:419d:a0f2:fdc0:b973 1m49s 1kb 1kb 0 tls://[fe80::415:f1ff:fe0f:c9b1%npub6]:47206
13 55b1e0707aad8e7e610f994e3e4e89026bd5c28d1c757e318109f81dcdfd8046 201:a938:7e3e:1549:c606:7bc1:9ac7:6c5 1m49s 1kb 1kb 0 tls://[fe80::f823:1ff:fe0c:66f2%npub6]:46975
14 14d906671ae880abe6c6d154cc5ac7b7a57caf903023c51fba0f76c9b3cc3ff9 203:b26f:998e:5177:f541:9392:eab3:3a53 1m49s 1kb 1kb 0 tls://[fe80::149b:c2ff:fe41:d0b9%npub6]:49839
15 1047f53b798b16b33f5bf580a2c4350ad15050f4a3249a0ab8c39c68b2b938bb 203:fb80:ac48:674e:94cc:a40:a7f5:d3bc 1m49s 1kb 1kb 0 tls://[fe80::431:54ff:fe5d:62a3%npub6]:45135
16 3aba23ca66c44e9a1433491536cf83640be391fdc4df55baf600afc003c22822 202:2a2e:e1ac:c9dd:8b2f:5e65:b756:4983 1m49s 1kb 1kb 0 tls://[fe80::d0e4:afff:feb0:7599%npub6]:2369
17 c2e40dd266d04f119258fcd700976cf9b4803c948a62dec5890b991ad29eed23 200:7a37:e45b:325f:61dc:db4e:651:fed1 1m49s 1kb 1kb 0 tls://[fe80::8034:46ff:fe36:8588%npub6]:32851
18 b6db80a4eaa43089fabf90e59191e9e49b3a4bf4bc886b84a1c80ffe3eb5eeed 200:9248:feb6:2ab7:9eec:a80:de34:dcdc 1m49s 1kb 1kb 0 tls://[fe80::9c68:58ff:fec5:f0d2%npub6]:38762
19 fc7146bc6885be43f538780253068b4becb4ba9ee6af810bfaa2414cbf68fb03 200:71d:7287:2ef4:8378:158f:ffb:59f2 1m49s 1kb 1kb 0 tls://[fe80::c6f:2bff:fe63:6317%npub6]:30776
20 dbc2c0af5258ddc6ec7040e0e1014a3561b72f3bfac8c7f471d91d082ba1e134 200:487a:7ea1:5b4e:4472:271f:7e3e:3dfd 1m49s 1kb 1kb 0 tls://[fe80::4ce9:edff:fe24:39c1%npub6]:6543
21 7bf23a44770116431606d2efecb88c7cac7539942f6fcdab924dfb0a5f84208b 201:1037:16ee:23fb:a6f3:a7e4:b440:4d1d 1m49s 1kb 1kb 0 tls://[fe80::b868:4ff:fe50:ccf5%npub6]:30655
22 f52123317578735b85448295d408872d2c4bb49b34851cc3947ebfbbc333ac50 200:15bd:b99d:150f:1948:f576:fad4:57ee 1m49s 1kb 1kb 0 tls://[fe80::ec18:f3ff:feb2:2b4c%npub6]:58064
23 2de17c2b7497875efc73205a4f961463363c6cca6bb36a81ea6ef2855eece6fb 202:90f4:1ea4:5b43:c508:1c66:fd2d:834f 1m49s 1kb 1kb 0 tls://[fe80::a0a1:bfff:fe59:272d%npub6]:1785
24 a12ef4e72a842d2be3b8c5095ff9e55777768554f98f2a577cbf8e70353bcd01 200:bda2:1631:aaf7:a5a8:388e:75ed:400c 1m49s 1kb 1kb 0 tls://[fe80::ccc1:81ff:fef4:c9ff%npub6]:18940
25 b1daeb29c8fa2e0827c01119aba160b9ddc464026703fe5fd533319f3c15e441 200:9c4a:29ac:6e0b:a3ef:b07f:ddcc:a8bd 1m49s 1kb 1kb 0 tls://[fe80::c4e4:9eff:fefd:8513%npub6]:61754
26 d365bf3a0dc60c1238ee270624d9d30ea526280d30a01207d8bd462187f6703f 200:5934:818b:e473:e7db:8e23:b1f3:b64c 1m49s 1kb 1kb 0 tls://[fe80::181b:cfff:fe20:63%npub6]:49725
27 1104f819e655b0e4448ff0542aebe3a5547b9fc4d094cc1943d196b3d134474f 203:efb0:7e61:9aa4:f1bb:b700:fabd:5141 1m49s 1kb 1kb 0 tls://[fe80::8aa:abff:fe74:6ff5%npub6]:18301
28 fbd5862cb73b02ab76e51851bb45c66b4dedf4eff5c862a6e5a798cf76b7486a 200:854:f3a6:9189:faa9:1235:cf5c:8974 1m49s 1kb 1kb 0 tls://[fe80::940f:40ff:fe98:768d%npub6]:11041
29 9edf0e72efbd83d5feb5e47dc57afc8772e7f574b04c1126d471404a4655a4d1 200:c241:e31a:2084:f854:294:3704:750a 1m49s 1kb 1kb 0 tls://[fe80::5808:d0ff:fe5f:8b51%npub6]:14163
30 1585bbcdeea49598958fb22ff3eb22d61c5152ffc98ec50abc809b3f437f2c4b 203:a7a4:4321:15b6:a676:a704:dd00:c14d 1m49s 1kb 1kb 0 tls://[fe80::10e1:d3ff:fe31:6d1e%npub6]:45914
31 89c700a772b1733cbcc1bb2b777c88dbc9d0ef818f8681646c62f61283e292f6 200:ec71:feb1:1a9d:1986:867c:89a9:1106 1m49s 1kb 1kb 0 tls://[fe80::f8a7:edff:fe3a:75fb%npub6]:5827
32 198ee8b6bf9b4266209145d340be012838223d6e20d31c96baa72f5709a86197 203:6711:7494:64b:d99d:f6eb:a2cb:f41f 1m49s 1kb 1kb 0 tls://[fe80::7433:5bff:fe6d:4a51%npub6]:56107
33 529ade78b8400ae80ac8d7a03e5e099f1356ac5403f1031b8c7ffd824788c0d7 201:b594:861d:1eff:d45f:d4dc:a17f:687 1m49s 1kb 1kb 0 tls://[fe80::b40b:faff:fedb:7a7c%npub6]:54498
34 5231fef6f699dacce7f0325b437f1f9453f24fc9ed4864b0d2a3e422ea538462 201:b738:424:2598:94cc:603f:3692:f203 1m49s 1kb 1kb 0 tls://[fe80::40f9:b5ff:fe38:6188%npub6]:26648
35 96fd288d7170c29ca101fb1ef6db95f1f8d73342ede478358a3555073a795179 200:d205:aee5:1d1e:7ac6:bdfc:9c2:1248 1m49s 1kb 1kb 0 tls://[fe80::a832:1ff:fe2d:93c%npub6]:3001
36 e2c64b37cd1f78c1f78eebb4d7ccf490342501feeefb9ec818ec17e0ef4333c5 200:3a73:6990:65c1:e7c:10e2:2896:5066 1m49s 1kb 1kb 0 tls://[fe80::d8fd:3fff:fee5:f485%npub6]:42031
37 7009452e510f4a58b8ccd280747fb6e21720b66f9a4cca2b910a741907c78377 201:3fda:eb46:bbc2:d69d:1ccc:b5fe:2e01 1m49s 1kb 1kb 0 tls://[fe80::b402:79ff:fe7f:c01%npub6]:21702
38 559099220c63bf504969dbc57be52194555b5a3828df2a05d1427f23cc8148e8 201:a9bd:9b77:ce71:2be:da58:90ea:106b 1m49s 1kb 1kb 0 tls://[fe80::bc70:e3ff:fe9a:18b7%npub6]:10527
39 29ba131ad92b011f01ae0b67c8f0718f8d6688412f9d7e2020722c15f23e296f 202:b22f:6729:36a7:f707:f28f:a4c1:b87c 1m49s 1kb 1kb 0 tls://[fe80::c010:5eff:fe4b:d3b5%npub6]:43398
40 345995ace65d1d282047528b2c9a4d25e190007d0ba4e52fa5f900416ebd5366 202:5d33:5298:cd17:16be:fdc5:6ba6:9b2d 1m49s 1kb 1kb 0 tls://[fe80::4880:4cff:fe59:7dda%npub6]:18369
41 3f159975b2392521d1aa3611b015f0637e6cdd30bacc8bc137de761fcb38131a 202:753:3452:6e36:d6f1:72ae:4f72:7f50 1m49s 1kb 1kb 0 tls://[fe80::9c03:bdff:fe91:2d4%npub6]:10960
42 99d53b539f000af0b93771b896ac42c809ec70ba77a36eb8095bbb65366d0bfa 200:cc55:8958:c1ff:ea1e:8d91:1c8e:d2a7 1m49s 1kb 1kb 0 tls://[fe80::48f6:3cff:fe42:4ac0%npub6]:2838
43 77974766926d4d0d206218c1ad47d164fd4f36f64553e44c6d461e088b3b6237 201:21a2:e265:b64a:cbcb:7e77:9cf9:4ae0 1m49s 1kb 1kb 0 tls://[fe80::e4f6:aeff:fe5b:c9bd%npub6]:4105
44 9ebec0675aad950eec0017c02a98148707df0c011db2b59bb076707c42eeaf0f 200:c282:7f31:4aa4:d5e2:27ff:d07f:aacf 1m49s 1kb 1kb 0 tls://[fe80::c48f:37ff:feff:e1bb%npub6]:9911
45 7d4d676c4295306cbd47cf74c43d385134f6f900f3d09a1ab57a28581f31d080 201:aca:624e:f5ab:3e4d:ae0:c22c:ef0b 1m49s 1kb 1kb 0 tls://[fe80::4807:5eff:fe1d:7053%npub6]:46623
46 3c29693a6e8b3857fef031b8b2bd832272f016873cc5b65834d2ccb87cd78ea8 202:1eb4:b62c:8ba6:3d40:87e:723a:6a13 1m49s 1kb 1kb 0 tls://[fe80::b408:ecff:febd:a192%npub6]:24629
47 28f484fd9b7744177afb9787755d1aa7ca76860e15312e18cbbca32d6fad6a77 202:b85b:d813:2445:df44:2823:43c4:5517 1m49s 1kb 1kb 0 tls://[fe80::3443:a6ff:fe1d:2c8d%npub6]:20263
48 73c7c5c3d04ed5e14182a1ba06ec5d4395fd26acc597dd40fa875d8352ed53c6 201:30e0:e8f0:bec4:a87a:f9f5:7917:e44e 1m49s 1kb 1kb 0 tls://[fe80::74c0:6bff:fe8d:36a8%npub6]:42095
49 76b567e02c25b0f100325c4626da4f4690db1eedf2cf15ac74917797cee88145 201:252a:607f:4f69:3c3b:ff36:8ee7:6496 1m49s 1kb 1kb 0 tls://[fe80::b0dd:e2ff:fe0c:b557%npub6]:21492
50 b02a6992c5b129a2b784ca8b2ef9b14c007dffcc31a70b3935432f314b33e883 200:9fab:2cda:749d:acba:90f6:6ae9:a20c 1m49s 1kb 1kb 0 tls://[fe80::746d:f1ff:fe66:d7f5%npub6]:1640
51 62a9549aa9e4a4976fa9828a004172c0480229857ae36e7e891d2117b5709c0d 201:755a:ad95:586d:6da2:4159:f5d7:fefa 1m49s 1kb 1kb 0 tls://[fe80::ec59:e5ff:fefa:6ff7%npub6]:31259
52 c4e2799b54a9ebbe1831ceb8e23376cfc46ea8c0a04b45280764b3afd8065a37 200:763b:cc9:56ac:2883:cf9c:628e:3b99 1m49s 1kb 1kb 0 tls://[fe80::cca9:feff:fe4f:65cd%npub6]:28546
53 afe3ddd4922cacc07417a4031ae1fd708f4bda457172c1238cb76d1941aec4a4 200:a038:4456:dba6:a67f:17d0:b7f9:ca3c 1m49s 1kb 1kb 0 tls://[fe80::d0ee:7ff:fec4:c100%npub6]:21542
54 799eaa9d9f9030c58722d11438dadf6578659e60a40f60702e8e9f9fc53d126a 201:1985:5589:81bf:3ce9:e374:bbaf:1c94 1m49s 1kb 1kb 0 tls://[fe80::b886:b6ff:fe01:173f%npub6]:25627
55 882c5a6c783816a57a1bae0d6b11d26829ca4755f242304c46633f84711c41c7 200:efa7:4b27:f8f:d2b5:bc8:a3e5:29dc 1m49s 1kb 1kb 0 tls://[fe80::74c7:f5ff:fe16:6015%npub6]:41120
56 f08cf3df720b579d58d1346999b2d4b642c6ba22669c8b4962cea73e192ec81d 200:1ee6:1841:1be9:50c5:4e5d:972c:cc9a 1m49s 1kb 1kb 0 tls://[fe80::e08a:6eff:fecf:e2f8%npub6]:39546
57 7bee356386fe0891a41ac03897e87e3f7a5bbf0a3df70a82daa85c1389f275e7 201:1047:2a71:e407:ddb9:6f94:ff1d:a05e 1m49s 1kb 1kb 0 tcp://gw424.vienna2.greenedgecloud.com:9943
58 da6ec3990a23b55edd3a8d7e4d09177ab0c422c0d5bd801cf317b957ab4aadde 200:4b22:78cd:ebb8:9542:458a:e503:65ed 1m49s 1kb 1kb 0 tcp://gw327.salzburg1.greenedgecloud.com:9943
59 fe2dc4af3568e067a7a1013bc00ccf58bdb5270cec8210dfe0188d4841819327 200:3a4:76a1:952e:3f30:b0bd:fd88:7fe6 1m49s 1kb 1kb 0 tcp://gw298.vienna1.greenedgecloud.com:9943
60 1f660bcc78a5a81c9441a9b04210e9ff158fbf784bbc88434b1a3f8868ec00b9 203:99f:4338:75a5:7e36:bbe5:64fb:def1 1m49s 1kb 1kb 0 tcp://gw294.vienna1.greenedgecloud.com:9943
61 f585bfcd846bcebaae4da707b7093acc95225dbf171abb176aaa08b4319b2a58 200:14f4:8064:f728:628a:a364:b1f0:91ed 1m49s 1kb 1kb 0 tcp://gw306.vienna2.greenedgecloud.com:9943
62 b02a7e7968ea50fb0071ee188e7958645737638a156a79250a93af5ace6a2ffe 200:9fab:30d:2e2b:5e09:ff1c:23ce:e30d 1m49s 1kb 1kb 0 tcp://gw422.vienna2.greenedgecloud.com:9943
63 3f4227b05b4847aa3fbdfee1125e4c03141827af9ee7df59841764d6ada53b36 202:5ee:c27d:25bd:c2ae:210:8f7:6d0d 1m49s 1kb 1kb 0 tcp://gw293.vienna1.greenedgecloud.com:9943
64 e97d55fa866f71db4a9a725e65e406dc1d6167054a7c04d1663b36c981d438eb 200:2d05:540a:f321:1c49:6acb:1b43:3437 1m49s 1kb 1kb 0 tcp://gw423.vienna2.greenedgecloud.com:9943
65 a87607fa7039e1fffedfaaa11cc201e971fda1a0f131af950e3c5002b2154d7b 200:af13:f00b:1f8c:3c00:240:aabd:c67b 1m49s 1kb 1kb 0 tcp://gw307.vienna2.greenedgecloud.com:9943
66 2c3ff177189cd1d05b050f2f04fc787d8f5b442db06e8bad03ccefbaf1d8f497 202:9e00:7447:3b19:717d:27d7:8687:d81c 1m49s 1kb 1kb 0 tcp://gw304.vienna2.greenedgecloud.com:9943
67 398d132a0890beec1006dd9e67730533450959511f0e19f3d76307dc8095fafd 202:3397:66af:bb7a:89f:7fc9:130c:c467 1m49s 1kb 1kb 0 tcp://gw309.vienna2.greenedgecloud.com:9943
68 3b37580f125ea77340230d3999894d5c10bde24347eeca9889a7a459c6926f0d 202:2645:3f87:6d0a:c465:fee7:9633:33b5 1m49s 1kb 1kb 0 tcp://gw425.vienna2.greenedgecloud.com:9943
69 d66cf25bab853a6f3390dd554018c51c8f8837849f7ff00b2e7e18c3e094b1aa 200:5326:1b48:a8f5:8b21:98de:4555:7fce 1m49s 1kb 1kb 0 tcp://gw297.vienna1.greenedgecloud.com:9943
70 216693abb54853ff9a76f0f135a8f7fc84edbe67a15342529b5486561c166013 202:f4cb:62a2:55bd:6003:2c48:7876:52b8 1m49s 1kb 1kb 0 tcp://gw300.vienna2.greenedgecloud.com:9943
71 536fca6659493cd4d7f0aaf502fc91d492758a165b1f5b699a7d29b0bc032ea2 201:b240:d666:9adb:cac:a03d:542b:f40d 1m49s 1kb 1kb 0 tcp://gw328.salzburg1.greenedgecloud.com:9943
72 62f4c62846bbfcabbb61929ea7b881ace710fd12fddc53deae98151f1913f3dd 201:742c:e75e:e510:d51:1279:b585:611d 1m49s 1kb 1kb 0 tcp://gw331.salzburg1.greenedgecloud.com:9943
73 76c591e40504f5ca6bdada769ed7a150d7a4af0fe0879d9987af0ea71cd91822 201:24e9:b86f:ebec:28d6:5094:9625:84a1 1m49s 1kb 1kb 0 tcp://gw333.salzburg1.greenedgecloud.com:9943
74 d50bb3c4ec2306b32421b605551545d794b200be4ff3a0e3afbd9b7fba313596 200:55e8:9876:27b9:f299:b7bc:93f5:55d5 1m49s 1kb 1kb 0 tcp://gw330.salzburg1.greenedgecloud.com:9943
75 ac24a420c7d8d99ad9c8d0b85cb1a5eca5ed5e23c2ef15b0b58565126870a8c3 200:a7b6:b7be:704e:4cca:4c6e:5e8f:469c 1m49s 1kb 1kb 0 tcp://gw299.vienna2.greenedgecloud.com:9943
76 934b67801fafe311def958e6e804706cdf7b3718741d07f5fe06cdc006f06c90 200:d969:30ff:c0a0:39dc:420d:4e32:2ff7 1m49s 1kb 1kb 0 tcp://gw324.salzburg1.greenedgecloud.com:9943
77 b946c9fd97f7bd8b6d3000588a8b2d3c35324a870061ce79e449bc706abb6af0 200:8d72:6c04:d010:84e9:259f:ff4e:eae9 1m49s 1kb 1kb 0 tcp://gw326.salzburg1.greenedgecloud.com:9943
78 f8eae7c7ebb498fcb1f211add4ff024ff7a8cb726b5d663d1d67715a3c672e32 200:e2a:3070:2896:ce06:9c1b:dca4:5601 1m18s 2kb 1kb 0 tcp://gw313.vienna2.greenedgecloud.com:9943
79 ed9c38780c862abe03b6b3bfd0e174be5b4f7d8a3b3a5b2c7995c2636e921090 200:24c7:8f0f:e6f3:aa83:f892:9880:5e3d 1m49s 1kb 1kb 0 tcp://gw291.vienna1.greenedgecloud.com:9943
~ # yggdrasilctl getdht
Public Key IP Address Port Rest
2c3390578917b583589a77b018a2fe68dedfe8dbdfa95a3234f2fc5024ab948d 202:9e63:7d43:b742:53e5:3b2c:427f:3ae8 0 66
2c3ff177189cd1d05b050f2f04fc787d8f5b442db06e8bad03ccefbaf1d8f497 202:9e00:7447:3b19:717d:27d7:8687:d81c 66 0
~ # yggdrasilctl getsessions
Public Key IP Address Uptime RX TX
519d9da200370ffa5ba44efb25d456a6351ab9af434ed02450af4f818cd42e28 201:b989:8977:ff23:c016:916e:c413:68ae 25m29s 2mb 563kb
~ # yggdrasilctl getself
Build name: yggdrasil
Build version: 0.4.7
IPv6 address: 202:9e63:7d43:b742:53e5:3b2c:427f:3ae8
IPv6 subnet: 302:9e63:7d43:b742::/64
Coordinates: [1 70 672 33 434]
Public key: 2c3390578917b583589a77b018a2fe68dedfe8dbdfa95a3234f2fc5024ab948d
getself
output seem to indicate that the node is pretty far from the root (issues on yggdrasil network repo seem to indicate the network is in fact rooted in its current shape).The above would lead me to think (again, without having hard evidence) that the problem is fundamentally in the working of the DHT. And since this is the fundamental idea of yggdrasil itself, changing that would be a massive undertaking.
If we consider the current network topology sufficient, I think it is instead better to create a from scratch implementation. One where we don't have a DHT,. The main idea is simple: we connect to a static peer list. As explained above all nodes are connected to all public peers. When some node wants to reach some address, it asks all its peers if it is connected to the address, to filter the possible paths, and then picks one. It is trivial to implement a periodic ping as a kind of latency check, which can then be advertised by peers, to select the "shortest path". The initial peer sends a request to the pub node to connect it to the remote, and the pub node sends a request on the persistent connection with the remote for said remote to initiate a new connection with the public peer. The initial node also opens a new connection with the public node, and the public node simply splices both connections together to bridge traffic. This setup also reduces impact of malicious nodes in the network, since the public peers are configured statically and we only have 1 hop max over these public peers. Underlay encryption can be implemented with a simple self signed TLS certificate, which can optionally be extended such that the signature is signed by the private key of the keypair used to generate the address. This way we also embed authentication in the TLS certificate. This should be sufficient for a first version and can later be improved.
Once again I'd like to stress that the above is a suggestion based on observations and (presumably educated) guesses about the cause of the current situation, and if anyone disagrees or have evidence otherwise, please add this to the discussion. Hand rolling a new setup is also going to take development resources, has no guarantees of being better (again due to lack of certainty about the cause of these issues), and from scratching is rarely the correct solution.
Thanks @LeeSmet. Yes, my proposition was as a temporary workaround to avoid immediate fallout from issues with Yggdrasil, namely loss of confidence in the grid as a whole when users can't create deployments. If this can buy us some time to develop a proper solution, I think it's worth considering, as a low investment and easily reversible course of action that seems likely to return us to a better performing state. It can also help by reducing the overall complexity of the situation we're trying to analyze.
I agree that understanding the root cause is essential to finding the right long term solution. If anyone following this issue can tag in others who might be able to provide insights, please do so.
I agree with scott, we need a solution as soon as we can
After taking some time to investigate in a bit more detail, here are some findings:
[A B P D C]
where P is the peer connected to the larger network. However as stated shortcuts are considered, and B should detects its connection to D, sending the packet directly to D, causing the path to become [A B D C]
. In theory.[1 3 5]
means the node is the 5th child of the 3rd child of the first child of the root. The root is at [ ]
.Based on the above, here are some options:
A B R D C]
. In theory since we use our own peer list, C would be connected to B as well, so the path should be [A B C]
, though current behaviour makes me not optimistic that this (always) works. Conversely, if one of the public nodes is the root, then the path would be [A B C]
by default, since it seems all peers prefer to have the lowest key as root, and since the network root is by default the lowest key, all peers connected to it will choose it as main root, hence not requiring to invoke any special shortcut behavior. This also prevents the network from rearanging when random nodes join, as random nodes with random keys could randomly have the smallest key, thus triggering a network reorg when the node joins. This is especially bad in case it is a temporary node (e.g. laptop) which "flickers" into and out of the network, as it would constantly cause a topology change. Lastly, I'll again iterate the point that I am not confident this approach will solve all our issues, especially not permanently.After some thining, I wonder if we can (for now) make at least our pub nodes connect directly to the root node. This would reduce the depth of our nodes in the tree, meaning we need less hops in general for packets to be routed in the worst case. Additionally, and more importantly, this would allow us to bypass the unknown nodes which are now bridging our network to the global yggdrasil network, and might be not optimal for that.
Considering connectivity is currently rather flaky already, it's doubtfull that it could get worse anyway.
The entire grid is currently reporting down In the explorer and by /status in the status bot, this appears to be planetary related. I’m showing no errors in node console.
This effect coincided with a farmer mass deploying 72 nodes across 6 racks within an hour.
Good idea there, if that could at least improve a little. For that, we would add the root node to the existing peer list? Or only have the root node as a peer for ZOS nodes?
It seems the root node is not publicly available, but instead a single node connected to it is. This is good enough as only a single node connection means this will be a common coordinate for the worst case. This node can be configured by adding
"tls://163.172.31.60:12221?key=060f2d49c6a1a2066357ea06e58f5cff8c76a5c0cc513ceb2dab75c900fe183b&sni=jorropo.net",
"tls://jorropo.net:12221?key=060f2d49c6a1a2066357ea06e58f5cff8c76a5c0cc513ceb2dab75c900fe183b&sni=jorropo.net"
to the peer list.
Can be done for gridproxy and blackbox already, for zos this should be added to our own pub nodes but preferably not every node (not sure how that will behave if that node suddenly gets bursted by 5K connections).
What about a set of permitter nodes?
could we overlay a reliable halo using a deployment co located with the public peers, with a public ip set and no planetary network interface, then run yggdrasil and
looking at the overall size of the yggdrasil network, theres 5900 nodes showing online, could it be possible when our root moves, the entire networks root is impacted because of the size of our tree vs the entire network? if we were to run the gen keys commands on these perimeter nodes and get their addresses theoretically we should be able to get their address down so that if the root moves, it moves to a reliable node that is lan speed to the public node it lives with.
gen keys documentation here https://yggdrasil-network.github.io/configuration.html
if 72 nodes from a single farm behind one ip address all connect to the public peers simultaneously would it not create a situation where the public peers may only have connections open with that one farm, that currently has a very reliable but low bandwidth connection? temporarily disabling the network until new peers have naturally reestablished connections outside of that farm?
here is my thought process at 22:52 gmt-6 central us Saturday, Michael and Thangwook started reporting what would later be discovered to be that an indexer had fallen out of sync due to i/o timeout.
at 23:00 I found that the explorer was showing all nodes down, /status was non functional in the bot, and i could not connect to the planetary network in the connect app, all Yggdrasil services were unresponsive,
i could deploy workloads on testnet on us servers / i could not deploy workloads on nodes in europe, outside of the foundation and greenedge, those deployments were taking two attempts an moving slow the errors were derivates of,
Couldn't get free Wireguard ports for node 4406 due to Error: Request failed with status code 502 due to failed to submit message: Post "http://301:de3e:5fe6:f341:21bc:ce3d:7927:1ebc:8051/zbus-cmd": dial tcp [301:de3e:5fe6:f341:21bc:ce3d:7927:1ebc]:8051: i/o timeout
i was not able to deploy on mainet for a short period and then it began behaving as testnet
im theorizing this rare condition that Michael has created multiple times since july/august when began to have over 50 nodes, is the cause of what lee observed when the root shifted, Michael has i believe 200mbit fiber to the nodes currently but plans to bring in a bigger connection,
tl;dr
but in the past week we have had a condition where
I'm theorizing that this storm of connections is overload the peering structure by creating short term feedback tunnels, where the public peers and redundant connections to the same farm, when that farm happens to have enough nodes to overwhelm the peers, and those nodes connect in fast enough succession, such that nodes outside of that farm have not also created new connections to provide alternative routes. if this is in fact what is happening, i believe it would temporarily cause all traffic to timeout in this loop.
Or simply, large farms cycling through the open peer connections, are disrupting the ability of planetary to maintain a healthy mesh add to this, we may have 1 node, attaching our subnet, which is potentially as large as the entire main network, i think this is why connections have been degrading ultimately michael and dany sing started deploying these 25-50+ node farms around the time we started having issues.
this my flow of how this farmers nodes, which represent a significant outlier in farm size may have taken down the gridproxy
the peers hold open about 72 connections at a time each, the open connections replace the oldest tunnels,
theirs currently 20 nodes on the peer list
20 peers x 72 connections 1,440 connections of capacity currently every node connects to ALL peers
true capacity is 72 mirrored connections
boots under 72 nodes could cause degradation by replacing some to most of the connections
boots over 72 nodes could possibly have the capacity to replace all of the public peers connections temporarily shutting down the grid proxy
Saturday night was the first time Michael went significantly over 72 nodes bootting in short succession (under 1 hour)....
supporting variable to Michaels farm being a possible cause
us node deployment was temporarily performing better then Europeans, not the normal condition for me
Europeans nodes co-located with the public gateways were performing better then those not.
the problem has improved with time but the degradation has worsened, there is still a string of connections happening from the nose on interval, this could be the cause of the timeouts disappearing for short periods as new nodes connect and those connection's restore the mesh.
this the current map of the yggdrasil network highlighted from the perspective of GENT04
this is our subnet, with gent 04 circled
this is what my plan of halos will make the map look like for gent04 (this was my public node, effectively peered with my nodes (6months ago)
this is what happens when one rogue node forces itself into the root of the network (this was my public node, not behaving(6 months ago)
also a coordinate of 1, seems to notate the node is accessible from the root this seems to happen without a peer entry for the root.
After some more investigation, it turns out there is actually a strong correlation between lag spikes on the chart of our pub node rtt times and data throughput on the public interface of these nodes.
If we isolate a node on the rtt graph and its public incoming traffic on the node, we can actually see a (near) perfect correlation:
bit unfortunate the graphs don't line up but timestamps should be sufficient to see the correlation.
I queried each of our public peers to see how many peer connections they report. This is using the remote debug version of yggdrasilctl getpeers
as demonstrated by the crawler. My script can be found here.
Each node from our list reports that it has exactly 1657 peers, with 2983 unique entries among them. That's more than the number of live nodes according to stats.grid.tf, which is encouraging. After running the script again about an hour later, I see the unique entries have fluctuated by about 20 but each node still reports 1657 peers. 3Nodes come and go, and there are other Yggdrasil nodes connected to our peers. Still, that's a relatively large fluctuation that could suggest there is some "fighting" among nodes to connect to the public peers.
I also checked how many of the peers from our list appeared in the lists returned from each of them. Since there are 31 total nodes in our list and they are all configured to connect to each other, we might expect to see values at or close to 30. Instead, I see values between 5-20. Also of note, these values fluctuate between immediate subsequent executions of the script, suggesting that the connections between the public peers are rather dynamic.
Finally, I checked how many of the peers also reported my machine as a peer in their reply. Across a several executions in the course of a few minutes, I saw this figure start at 16, rise to 21, fall to 14, then rise back to 18.
If this data is accurate (ie, the remote debug feature actually retrieves a full peer list from each remote), then the topology even with our subnet is shifting rapidly, minute to minute. The most concerning part to me is that our public peers don't seem to be maintaining stable connections to each other. If Yggdrasil nodes are limited to 1657 peers, and connections start to churn when more peers are attempting to connect, then we could expect to see increasing performance degradation as the size of Grid 3 grows beyond that limit. A large farm joining all at once could indeed also make the situation temporarily much worse.
I think it could be worth opening an issue for the Yggdrasil team with these findings. It's very possible that no other nodes in the network have this many peers attempting inbound connections (I queried the node one hop away from the root that's listed above, and it had ~200 peers), so there could be a scaling issue that Yggdrasil will need to solve anyway and would also affect a fork.
Strictly observational, But I found when deploying my public nodes that connections to over about 50 peers would start rotating connections, and when I crossed about 75 my nodes would crash on boot when they attempted to connect to all of them at once.
Realistically yggdrasil isn’t designed to have a centralized exchange, the concept is for nodes to peer with other local nodes around them furthering the mesh by the peers being connected almost in a cascade across a geographical area.
I think we should aim for each of the public peers to be able to maintain about 50 connections by their peers list, to do this we have to consider both the peers we add to the list, and the ones that will be found over multi cast.
We could functionally make this happen in the short term by creating small cell interchanges.
ultimately I think having cells of 22 peer co located clusters is the answer to work with what we have until a solution is found to the limitations were hitting,
So,
we have 22 peers in gent isolated on a vlan from all other nodes running yggdrasil
we have 22 peers in Salzburg isolated on a vlan from all other nodes running yggdrasil
So we have 22 peers in st gallen isolated on a vlan from all other nodes running yggdrasil
each cluster of 22 peers will find its collocated partners over multi cast peering
Gent Peers->Salzburg St.Gallen peers to ->Salzburg
Salzburg peers to odd gent nodes Salzburg peers to even st Galen nodes
This would make Gents peer list Salzburg 1-22 configured Gent 1-22 by multicast
This would make the St Gallen peer list Salzburg 1-22 configured St gallen 1-22 by multicast
The Salzburg peer list would be 11 odd Gent nodes 11 even St Gallen nodes 22 Salzburg nodes by
this SHOULD make all of our public peers stay reliably connected, and provide enough capacity to support our current node structure, if it works the model can scale with the network as we add new gateway farms.
this appears to show adding the public peering effectively having put our subnet behind the public peers. the nodes that make up the wings are lose associated with node ids of the public peer list, and 3nodes are the large orb
a87607fa7039e1fffedfaaa11cc201e971fda1a0f131af950e3c5002b2154d7b 200:af13:f00b:1f8c:3c00:240:aabd:c67b 1m49s 1kb 1kb 0 tcp://gw307.vienna2.greenedgecloud.com:9943
vienna 2 was the root of our network at this point in time that things looked to be working well.
Checked some of our public nodes, they consistently have ~3K peers. While monitoring the peers of one of these, there were occasions were almost all of them got removed at once except for the newly added root peer and 1 other. They got added themselves again over a couple of seconds. There seems to be a correlations of this occurring and the latency spikes.
I've changed the binaries of the 2 devnet nodes with a local version with basic prometheus instrumentation tracking how many of each type of packet is sent/received, and how many peers are added/removed in general. Lets see if this has some info by tomorrow.
Graphs of the packets going through these nodes show a correlation between ping spikes to these nodes and "tree" packets coming in. This indicates the network is reorganizing. Also peer count was mostly steady (except for some instances yesterday), so this seems to be an unrelated issue. Considering that our public nodes have stable connections to eachother (at least they are always connected when I get the peer lists on these public nodes on devnet), this points to the network reorganization being triggered by "external" nodes (i.e. the yggdrasil network itself, which we inherently don't control).
After some deliberation, it was decided to not fork yggdrasil at this time, and remain part of the ecosystem. We hope that future developments will increase the stability of the network, but won't actively make changes to the current situation.
To solve the issues for the grid, which is the fact that RMB runs over yggdrasil and thus might cause temporal node instability, it was decided to introduce better proxy support, such that public nodes on the "regular" internet can proxy messages to "hidden" nodes. This removes the dependency on yggdrasil. See threefoldtech/home/issues/1373 for that.
Since this month (Nov) we started to notice some serious Yggdrasil degradation again. Mostly these are connectivity issues, but the latency in general went up by a lot also. We saw these issues before and could temporary resolve them with our peer list but we knew in advance it would return.
Our monitoring is sending pings over yggdrasil to all public nodes / ygg peers in the peer list. A performance graph of the last month:
This is time vs latency. The unavailability of an yggdrasil endpoint also happens regularly without a clear pattern. You can visit the graph at: https://mon.grid.tf/d/000000016/blackbox-yggdrasil-icmp?orgId=1&refresh=10s&from=now-24h&to=now Keep in mind the above pings are done to the peer list nodes, so the monitoring server has these peers configured. Communication to ygg addresses not in this list is even worse.
I would like to state the urgency of this matter, since problems with yggdrasil reflect on all nets (main, dev, test and qa) and every service that uses yggdrasil. One could say yggdrasil is a backbone service of the grid, like the TFchain is. If it does not work, the Grid does not work .. This will result in mostly timeout issues like these ones:
Since the start of this month we also noticed these kinds of log statements in ygg client logs: https://github.com/threefoldtech/tf_operations/issues/1269#issuecomment-1317304896
Ygdrasil in general seems to degrade with each upgrades, as can be seen here: https://github.com/yggdrasil-network/yggdrasil-go/issues/978
We see no real progress in improving these issues: https://github.com/yggdrasil-network/yggdrasil-go/releases Although the last release might have improved the routing problems? -> https://github.com/yggdrasil-network/yggdrasil-go/releases/tag/v0.4.7 ->
Buffers are now reused in the router for DHT and path traffic, which improves overall routing throughput and reduces memory allocations
A possible reason for all this could be point 8, 9 and 10 stated at this page: https://github.com/Arceliar/ironwood/#known-issues