clusternet: lost peers - Githubissues

lorenz commented 1 year ago

At scale we see clusternet losing peers (as in no updated: log entries are being created and no connectivity to the other node is possible). Control plane seems happy, the clients as well so it seems like the update events get lost somewhere.

lorenz commented 1 year ago

Some logs from metropolis-01bed4ca52aa215dc936ae376ba394ce:

root.role.clusternet.pull        I0626 22:25:44.340853 clusternet.go:186] Node metropolis-9206b11773d725b9026131baa493ead3 updated: pk +f9V4SQ/IvUhnYtk6Q82J6FPDSmNODLPXa5+l3gLrEQ=, address 139.178.70.51, 
prefixes [10.194.37.0/24 139.178.70.51/32]
root.role.clusternet.pull        I0626 22:25:44.343201 clusternet.go:203] Successfully updated 1 out of 1 nodes
root.role.clusternet.pull        I0626 22:25:44.355429 clusternet.go:203] Successfully updated 0 out of 0 nodes
root.role.clusternet.pull        I0626 22:25:44.355584 clusternet.go:203] Successfully updated 0 out of 0 nodes
root.role.clusternet.pull        I0626 22:25:44.361057 clusternet.go:203] Successfully updated 0 out of 0 nodes
root.role.clusternet.pull        I0626 22:25:44.365231 clusternet.go:203] Successfully updated 0 out of 0 nodes
root.role.clusternet.pull        I0626 22:25:44.369086 clusternet.go:203] Successfully updated 0 out of 0 nodes
root.role.clusternet.pull        I0626 22:25:44.373850 clusternet.go:203] Successfully updated 0 out of 0 nodes
root.role.clusternet.pull        I0626 22:25:44.377682 clusternet.go:203] Successfully updated 0 out of 0 nodes
root.role.clusternet.pull        I0626 22:25:44.380324 clusternet.go:203] Successfully updated 0 out of 0 nodes
root.role.clusternet.pull        I0626 22:25:44.383977 clusternet.go:203] Successfully updated 0 out of 0 nodes
root.role.clusternet.pull        I0626 22:25:44.387262 clusternet.go:186] Node metropolis-7af5a132d00c12254172dde47fda685f updated: pk l0D/DEzAhRitmYFmPxpVdrd34whP71UDE+n/iCMULCk=, address 145.40.87.239,  prefixes [10.194.36.0/24 145.40.87.239/32]
root.role.clusternet.pull        I0626 22:25:44.389824 clusternet.go:203] Successfully updated 1 out of 1 nodes
root.role.clusternet.pull        I0626 22:25:44.394312 clusternet.go:203] Successfully updated 0 out of 0 nodes
root.role.clusternet.pull        I0626 22:25:44.396030 clusternet.go:186] Node metropolis-c4d4c7984f7f9700424f7f7c3407432d updated: pk h1rwS0YPqlELrvM/VeI1DVrZvR+D7kwO8pLwJLA5RUk=, address 147.28.167.61, prefixes [147.28.167.61/32]

So it works, but only for like 10% of nodes were in the logs and most pod-to-pod connectivity on the demo cluster is broken.

q3k commented 1 year ago

I wasn't able to replicate this locally yet. For now I've fixed some logging and spurious updates: https://review.monogon.dev/c/monogon/+/1876

I'll stare at the code for a bit longer and see what I can figure out.

q3k commented 1 year ago

New cluster deployed with logging added, let's see what we can figure out now.

$ for n in $(mctl node list | grep -v NODE); do echo -ne $n; mctl node logs $n --dn root.role.clusternet.pull --backlog 100 | grep Total | tail -n 1 | cut -d ':' -f 5; done | tee Work/issue/235/nodes.txt

q3k commented 1 year ago

The results are here: https://bin.monogon.dev/pasta/sloth-turtle-turtle

We can now analyze the number of nodes/prefixes seen by each node.

The distribution is unusual. 81% of nodes have the correct number of nodes in clusternet, 17% have another exact same number which is incorrect but the same, and the rest have some amount close to the second number, but not exactly the same.

Interestingly, the number of nodes that see the second node count in the cluster is very close to that count. But that might be a red herring.

Numbers: https://bin.monogon.dev/pasta/otter-shark-worm

q3k commented 1 year ago

Let's look at one particular node in the second group, metropolis-b5b35c1327cbb8e51758145013667b5a, and figure out what other nodes it sees in clusternet:

$ mctl node logs metropolis-b5b35c1327cbb8e51758145013667b5a --dn root.role.clusternet | grep Node | awk '{ print $6 }' > seen-by-b5b35c1327cbb8e51758145013667b5a.txt

$ for n in $(cat seen-by-b5b35c1327cbb8e51758145013667b5a.txt); do grep $n nodes.txt; done

https://bin.monogon.dev/pasta/eel-ape-sloth

It almost exclusively sees nodes from the second group! I'm very tempted to actually try to turn this into a connection graph and see if there's any hint there.

q3k commented 1 year ago

Hm, one pattern emerges if we look at node start times:

$ for n in $(mctl node list | grep -v NODE); do echo -ne "$n "; mctl node logs $n --dn panichandler | grep -v "Logs from" | head -n 1 | awk '{ print $3 }'; done | tee nodetimes.txt
$ head -n 1 nodetimes.txt 
metropolis-00022e30ebd89c5e9f458a7854727a7e 19:49:15.383178
$ for n in $(cat nodes-LARGE.txt | cut -d ' ' -f 1); do grep $n nodetimes.txt | cut -d ' ' -f 2; done | sort | tail -n 1
20:06:47.420462
$ for n in $(cat nodes-SMOL.txt | cut -d ' ' -f 1); do grep $n nodetimes.txt | cut -d ' ' -f 2; done | sort | head -n 1
20:10:56.352778

All the nodes in the correct group were started not after 20:06:47. All the nodes in the incorrect group were started after 20:10:56. So, what happened at this time? Did the controller node restart maybe?

q3k commented 1 year ago

The control plane node is up since 19:42:23, so it's not about a restart:

$ mctl node logs metropolis-7140ef89bd887f2179fa61496dd6846c | head -n 10
=== Logs from metropolis-7140ef89bd887f2179fa61496dd6846c (XXX):
panichandler                     E0704 19:42:23.640085 panichandler.go:81] Failed to open core runtime log file: read-only file 
system
panichandler                     W0704 19:42:23.640104 panichandler.go:82] Continuing without persistent panic storage.
panichandler                     I0704 19:42:23.640111 panichandler.go:91] Panic console: /dev/tty0
panichandler                     I0704 19:42:23.640118 panichandler.go:91] Panic console: /dev/ttyS0
panichandler                     I0704 19:42:23.640124 panichandler.go:91] Panic console: /dev/ttyS1
init                             I0704 19:42:23.640131 main.go:111] Starting Metropolis node init
supervisor                       I0704 19:42:23.643935 supervisor_processor.go:59] supervisor processor started

Unfortunately, we don't have control plane RPC logs to say what happened between 20:06 and 20:10. :/

q3k commented 1 year ago

Let's try some more measurements. In this case, let's try rebooting nodes from both groups (correct/large and incorrect/small) and see what happens after they reboot.

I'm rebooting metropolis-d53839a458903311d92a3ce180cd5d8e which belongs to the incorrect/small group.

I'm rebooting metropolis-f6acd627c28f8ddfa6c93169b6cd25f0 which belongs to the correct/large group.

q3k commented 1 year ago

Both came back up, but they both only see eachother now.

Okay, I think I know where to start looking for the bug.

q3k commented 1 year ago

https://review.monogon.dev/c/monogon/+/1912

Found a bug which was likely the culprit. Added a regression test for it.

lorenz commented 1 year ago

Looks very good on the new deployment, I believe this has fixed it.

monogon-dev / monogon

clusternet: lost peers #235