Closed antifuchs closed 10 months ago
Oops, this happens in the mainline Tailscale add-on too, if you disable user-mode networking: I'd left it at the default, which is "on", but the UI shows the setting as off. Turning it on and then off properly disables user-mode networking, and then both add-ons exhibit the enormous CPU usage behavior.
I really don't know, this seems to be some tailscale issue. Maybe some "strange" routing to the tailscale0
interface causes high load. I'm just guessing. As I remember, the VM OS has a bit different kernel features turned on/off.
If you set the add-on's log_level
to debug
, the tailscaled logs will not end after the first 200 lines. Maybe you can see something is going on.
Note: This forked add-on is currently nearly identical with the original (my changes got finally merged), so I planned to slowly abandon it, but the original add-on lags new tailscale releases again. This add-on can export the certs, and has more proxy/funnel config possibility, but otherwise identical.
Thanks for the quick response & sorry for assigning blame to your add-on; it really is a universal issue. I can't tell what's going on from the logs yet, but I'm setting up supervisor ssh access & will investigate the tailscaled process more closely from there.
No problem. :)
Note: You can access the tailscale add-on's container from the ssh add-on with docker exec -it `docker ps -q -f name=tailscale` /bin/bash
Note: To get access to the OS, see: https://developers.home-assistant.io/docs/operating-system/debugging/ The ssh add-on is only a docker container, you can see really a lot of stuff, but accessing the underlying OS is on a different port. On the OS you can see the past logs (even after several restarts) with sg. like journalctl CONTAINER_NAME=addon_xxxxxxxx_tailscale | tail -n 1000
, where xxxxxxxx
is 09716aab
in case of my add-on, and is a0d7b954
in case of the official add-on.
I got somewhat further: Looks like tailscaled is trying to listen to :443
on the tailnet IPv6 address and getting EADDRNOTAVAIL
back - while listening on :443
on the ipv4 tailnet address. So something is wonky here and I think this is probably a tailscaled bug.
I'll report that to the upstream folks & hopefully they'll be able to diagnose and fix that (:
I noticed the same thing, it seems to be the combination of tailscale serve reset
followed by another tailscale serve (--bg) localhost:8123
.
At least for me there are no CPU problems when Proxy
is deactivated. I have not tested the combination with Funnel
.
For reference: https://github.com/tailscale/tailscale/issues/10320#issuecomment-1842983961
Just out of interest @lmagyar, do you also use the proxy in conjunction with tailscale0 interface (without userspace-networking) without any CPU problems between add-on restarts? If it works for you, what does your HA OS setup look like?
I find it quite interesting that the problem hasn't really been reported yet, I had mentioned it a few weeks ago but no new issue has come up since.
I'm using HA OS on real rPI3-s. Now I made some tests, can't reproduce this.
I've tested HA OS 32bit on rPI3 and HA OS 64bit on rPI3, with and without userspace networking (proxy and funnel was always enabled), restarts, but no problem. With and without IPv6, no problem (though the IPv6 wasn't a real network, only an fe80::
address).
I still think this is somehow HA OS VM related.
I'm afraid you're right... I'll experiment with other VM settings in my Proxmox later today, if that doesn't help it's probably the qcow2 image. But let's wait and see what the tailscale devs say upstream first.
New tailscale v1.56.0 and new add-on version is out, does it show any change?
Unfortunately, it doesn't look like the problem has been solved with the latest version.
However, there are a few nice updates regarding the more integrated Web UI. Apparently there are still a few problems there due to the NGINX proxy for the Ingress connection. https://tailscale.com/kb/1325/device-web-interface
I looked at the problem again this evening. As a short term workaround (and probably sustainable solution) my suggestion would be to switch from an oneshot to longrun s6-service for both proxy and funnel. This would make a reset unnecessary and thus solve the problem. What do you think @lmagyar? I can also create a PR for this.
For the readonly UI I've created a new issue: #97
How does a longrun service can fix this CPU issue? UPDATE: I meant: how can non-reseting the serve settings can solve the CPU issue???
Do you agree with these steps?
--bg
, don't reset serve state--bg
proxy/funnel settings and delete them only once on the first startupLater, when non-read-only UI works and there is any serve/proxy/funnel web UI in the future:
To answer your question, I'm not sure what exactly is the source for the high tailscaled CPU usage, but something seems to hang in the background when first a reset and then a new serve command is executed. The problem was analyzed in more detail by @antifuchs and mentioned in the upstream issue linked above. However, there has not yet been any feedback from the Tailscale dev team.
I think in the long term the issue has to be solved upstream, but by eliminating the reset command, the "advanced" solution could also be provided out of the box without further add-on parameters. So changing the service to longrun should be a win-win in the short term.
I would agree with your suggested steps.
New version is out, proxy and funnel merged into a new longrun serve service (due to the new cli, we can exec either tailscale serve or tailscale funnel, but not both if we don't use the --bg
option). It looks better, and a lot of code has been removed. :)
Does this solve the CPU issue?
Proxy works great, thank you very much!
I'm closing this. As I understand this high CPU issue is solved with the longrun, merged proxy+funnel=serve service.
I've been using the "tailscale with features" add-on for a few months, but recently I've noticed that my hassos VM is occupying a whole CPU core, with
tailscaled
pegged at 100% CPU usage.I'm not sure where that comes from - the logs are quiet and the proxied hass VM responds well. I can not reproduce this with the mainline "Tailscale" plugin - with it, the CPU usage stays under 5%.
Please let me know what else you need to debug this issue, I'll do the best I can to provide you with data (however, I switched to the mainline plugin as that supports configuring a proxy these days...)