Control plane API stops responding

gix commented 2 years ago

Bug Report

Description

I've set up a control plane according to the docs for vmware. The only thing changed in controlplane.yaml is static network configuration. The node boots up correctly, but after some time stops responding.

talosctl just hangs for any command without any output. I can still ping the node, and kubectl get pods works and shows the default pods as running. The VM shows no errors in its console. Logs sent to a TCP endpoint also do not show any errors. After a reboot it seems to work again, but running talosctl health a few times and the node stops responding again after a few minutes. Even if left alone this seems to happen after a day or so.

Logs

Not sure what I should append here. A rebooted node doesn't seem to show logs from the previous run, and once this state is reached I cannot get any logs.

Environment

Talos version: v0.14.1
Kubernetes version: v1.23.0 (Client), v1.23.1 (Server)
Platform: VMware ESXi 6.7

smira commented 2 years ago

It might be helpful to access console logs or video output of the VM once it is in this "hanging" state.

Talos can also stream the logs at least to the moment it hangs via talosctl dmesg -f.

My only guess is that VM doesn't have enough resources to run the apid process, but this is a wild guess.

gix commented 2 years ago

I've doubled the resources for the test VM from ones states in the docs to 8GB memory and 20GB disk space. After reboot I did a health check every 10 seconds. 13 succeeded, the 14th got stuck at 11:55:10 with:

discovered nodes: control plane: ["10.1.0.191"], worker: []
waiting for etcd to be healthy: ...
waiting for etcd to be healthy: OK
waiting for apid to be ready: ...
waiting for apid to be ready: OK
waiting for kubelet to be healthy: ...
waiting for kubelet to be healthy: rpc error: code = DeadlineExceeded desc = context deadline exceeded

The dmesg -f continues afterwards, showing only NTP messages. Every newly submitted talosctl command hangs.

Output from dmesg -f: dmesg-f.log Logs received by a TCP collector: received.log

smira commented 2 years ago

I don't see anything in the logs which my point towards the problem. My only guess is that CNI messes up networking in some way? Or some other privileged workload?

I don't see any problem from Talos side right now.

gix commented 2 years ago

It looks like this depends on the number of connections. The VM can run unused overnight and still accept connections the next day. But a talosctl support after a reboot will hang midway through. Is there any way to get more debug output in the console (in addition to debug: true? A debug build of talos?

smira commented 2 years ago

we don't have any specific way to do more debugging. one way might be to schedule a privileged pod on the node via Kubernetes and try regular Linux troubleshooting if there's any sign of some resource exhaustion. I'm not sure where to even look at right now.

In terms of resource usage talosctl dashboard might help. As for the API connections, talosctl logs -f apid and talosctl logs -f machined might help.

Filip7656 commented 9 months ago

I have similar issue described here #8049, exactly same conditions and also talos api hangs. Overnights everything was fine then the next day I ran talosctl health and it hanged on disk checking. After that i couldnt access other talosctl commands. I think it has something to do with VMware networking.

RvRuttenFS commented 5 months ago

We also experience the same thing in VMWare. Using talosctl services on a watch will also end up in the same state after a couple of minutes. Looked at https://github.com/siderolabs/talos/issues/8049 too and have the same log output (or lack of). We used OVA 1.6.6.

We tried E1000, E1000E and VMXNET3 NIC types to rule out issues there.

We managed to stop the apid proces/container and when it restarts the problem is "reset" the same way a reboot resets it... until the next day / talosctl services or talosctl health.

Any other suggestions?

Filip7656 commented 5 months ago

@RvRuttenFS Have you tried with older talos versions? 1.6.0 etc.? Try also OVA 1.6.7 (released two days ago) This issue took two weeks to resolve, and my solution was upgrading to newer version which had newer linux kernel.

RvRuttenFS commented 5 months ago

Thanks for your suggestion. Yes, we tried OVA 1.6.0, 1.65, 1.6.6 and 1.6.7. Also we tried many settings like Static IP and DHCP, flannel and cillium CNI, NTP on and off.

We think apid is somehow partially crashing (or maybe some other components behind it). https://www.talos.dev/v1.6/learn-more/components/#components

smira commented 5 months ago

If apid is crashing, you will see it in the logs.

Quick check for apid from outside is to do talosctl --endpoint IP --nodes IP version, this API should always respond as long as apid is still listening.

Filip7656 commented 5 months ago

So you have tried all the stuff I did, :/ What version of vmware you are running? Have you tried to install it from ISO and not from OVA? (be sure to change disk settings in order for them to be seen by talos)

RvRuttenFS commented 5 months ago

@smira Ran the talosctl command and got back:

Client:
    Tag:         v1.6.2
    SHA:         26eee755
    Built:
    Go version:  go1.21.6 X:loopvar
    OS/Arch:     darwin/arm64
Server:

So no info/response there.

Looking at the log (talosctl logs apid --tail -1) only shows 1 older entry.

Any suggestions where or what kind of other logs may help in this?

smira commented 5 months ago

I'm confused - how can you access apid logs if you can't access API?

Look at the console/serial logs. If nothing there, I'd assume it's not Talos.

Talos does a healtcheck on apid, so if it stops responding, it should print to the console.

RvRuttenFS commented 5 months ago

We asked ourselves the same question, that's why we said "partially" on purpose. Is there anything else we can check or confirm to make sense of this weird behavior?

After killing the apid proces (through a privileged debug pod) a new apid process starts and that does reset something, as we can then run talosctl again. Same effect as a reboot of the node.

I also noticed 502 Bad Gateway errors sometimes when using talosctl.

Lastly, I now see I forgot to mention we use Omni SaaS - if that makes any difference.

smira commented 5 months ago

If using Omni, you have console logs available in the machine view.

And it'd better to create an issue in the Omni repo. Next release of Omni should have omnictl support command to generate a great support bundle.

RvRuttenFS commented 5 months ago

Seems we have found something. As I highjacked this issue for a bit it seems fair to update on what caused it for us.

In our cluster patch yaml file in Omni we had debug:true set. But instead of giving us more logging, it would stop giving logs regarding apid's logs (this is actually a bug!). After some time the logs that are not visible would "fill up" something/somewhere/buffer and that caused a freeze in APID - only in Vmware/ESX. After we removed this debug key from the cluster patch, no more strange behavior was observed and the clusters worked again.

Not sure if @Filip7656 has used this debug key too, but if you did - now you know this is something that should not be used. If not, I hope you will figure out what is causing it for you.

Thanks everyone!

siderolabs / talos