Really high CPU usage - Githubissues

antifuchs commented 11 months ago

I've been using the "tailscale with features" add-on for a few months, but recently I've noticed that my hassos VM is occupying a whole CPU core, with tailscaled pegged at 100% CPU usage.

I'm not sure where that comes from - the logs are quiet and the proxied hass VM responds well. I can not reproduce this with the mainline "Tailscale" plugin - with it, the CPU usage stays under 5%.

Please let me know what else you need to debug this issue, I'll do the best I can to provide you with data (however, I switched to the mainline plugin as that supports configuring a proxy these days...)

antifuchs commented 11 months ago

Oops, this happens in the mainline Tailscale add-on too, if you disable user-mode networking: I'd left it at the default, which is "on", but the UI shows the setting as off. Turning it on and then off properly disables user-mode networking, and then both add-ons exhibit the enormous CPU usage behavior.

lmagyar commented 11 months ago

I really don't know, this seems to be some tailscale issue. Maybe some "strange" routing to the tailscale0 interface causes high load. I'm just guessing. As I remember, the VM OS has a bit different kernel features turned on/off.

If you set the add-on's log_level to debug, the tailscaled logs will not end after the first 200 lines. Maybe you can see something is going on.

Note: This forked add-on is currently nearly identical with the original (my changes got finally merged), so I planned to slowly abandon it, but the original add-on lags new tailscale releases again. This add-on can export the certs, and has more proxy/funnel config possibility, but otherwise identical.

antifuchs commented 11 months ago

Thanks for the quick response & sorry for assigning blame to your add-on; it really is a universal issue. I can't tell what's going on from the logs yet, but I'm setting up supervisor ssh access & will investigate the tailscaled process more closely from there.

lmagyar commented 11 months ago

No problem. :)

Note: You can access the tailscale add-on's container from the ssh add-on with docker exec -it `docker ps -q -f name=tailscale` /bin/bash

Note: To get access to the OS, see: https://developers.home-assistant.io/docs/operating-system/debugging/ The ssh add-on is only a docker container, you can see really a lot of stuff, but accessing the underlying OS is on a different port. On the OS you can see the past logs (even after several restarts) with sg. like journalctl CONTAINER_NAME=addon_xxxxxxxx_tailscale | tail -n 1000, where xxxxxxxx is 09716aab in case of my add-on, and is a0d7b954 in case of the official add-on.

antifuchs commented 11 months ago

I got somewhat further: Looks like tailscaled is trying to listen to :443 on the tailnet IPv6 address and getting EADDRNOTAVAIL back - while listening on :443 on the ipv4 tailnet address. So something is wonky here and I think this is probably a tailscaled bug.

I'll report that to the upstream folks & hopefully they'll be able to diagnose and fix that (:

elcajon commented 11 months ago

I noticed the same thing, it seems to be the combination of tailscale serve reset followed by another tailscale serve (--bg) localhost:8123.

At least for me there are no CPU problems when Proxy is deactivated. I have not tested the combination with Funnel.

For reference: https://github.com/tailscale/tailscale/issues/10320#issuecomment-1842983961

elcajon commented 11 months ago

Just out of interest @lmagyar, do you also use the proxy in conjunction with tailscale0 interface (without userspace-networking) without any CPU problems between add-on restarts? If it works for you, what does your HA OS setup look like?

I find it quite interesting that the problem hasn't really been reported yet, I had mentioned it a few weeks ago but no new issue has come up since.

lmagyar commented 11 months ago

I'm using HA OS on real rPI3-s. Now I made some tests, can't reproduce this.

I've tested HA OS 32bit on rPI3 and HA OS 64bit on rPI3, with and without userspace networking (proxy and funnel was always enabled), restarts, but no problem. With and without IPv6, no problem (though the IPv6 wasn't a real network, only an fe80:: address).

I still think this is somehow HA OS VM related.

elcajon commented 11 months ago

I'm afraid you're right... I'll experiment with other VM settings in my Proxmox later today, if that doesn't help it's probably the qcow2 image. But let's wait and see what the tailscale devs say upstream first.

lmagyar commented 10 months ago

New tailscale v1.56.0 and new add-on version is out, does it show any change?

elcajon commented 10 months ago

Unfortunately, it doesn't look like the problem has been solved with the latest version.

However, there are a few nice updates regarding the more integrated Web UI. Apparently there are still a few problems there due to the NGINX proxy for the Ingress connection. https://tailscale.com/kb/1325/device-web-interface

elcajon commented 10 months ago

I looked at the problem again this evening. As a short term workaround (and probably sustainable solution) my suggestion would be to switch from an oneshot to longrun s6-service for both proxy and funnel. This would make a reset unnecessary and thus solve the problem. What do you think @lmagyar? I can also create a PR for this.

lmagyar commented 10 months ago

For the readonly UI I've created a new issue: #97

How does a longrun service can fix this CPU issue? UPDATE: I meant: how can non-reseting the serve settings can solve the CPU issue???

Do you agree with these steps?

make proxy/funnel longrun and don't use --bg, don't reset serve state
drop advanced_config, users can freely use command-line serve/funnel commands any time
~only this fork:~ detect that there are previously saved --bg proxy/funnel settings and delete them only once on the first startup

Later, when non-read-only UI works and there is any serve/proxy/funnel web UI in the future:

drop proxy/funnel add-on configs and services altogether

elcajon commented 10 months ago

To answer your question, I'm not sure what exactly is the source for the high tailscaled CPU usage, but something seems to hang in the background when first a reset and then a new serve command is executed. The problem was analyzed in more detail by @antifuchs and mentioned in the upstream issue linked above. However, there has not yet been any feedback from the Tailscale dev team.

I think in the long term the issue has to be solved upstream, but by eliminating the reset command, the "advanced" solution could also be provided out of the box without further add-on parameters. So changing the service to longrun should be a win-win in the short term.

I would agree with your suggested steps.

lmagyar commented 10 months ago

New version is out, proxy and funnel merged into a new longrun serve service (due to the new cli, we can exec either tailscale serve or tailscale funnel, but not both if we don't use the --bg option). It looks better, and a lot of code has been removed. :)

Does this solve the CPU issue?

elcajon commented 10 months ago

Proxy works great, thank you very much!

lmagyar commented 10 months ago

I'm closing this. As I understand this high CPU issue is solved with the longrun, merged proxy+funnel=serve service.

lmagyar / homeassistant-addon-tailscale

Really high CPU usage #82