tailscale / tailscale

The easiest, most secure way to use WireGuard and 2FA.
https://tailscale.com
BSD 3-Clause "New" or "Revised" License
18.44k stars 1.41k forks source link

Linux: portmapper still consumes CPU after tailscaled disconnects #4891

Closed Qubitium closed 8 months ago

Qubitium commented 2 years ago

What is the issue?

Did not expect tailscaled to have so many interrupt/events ~17-40/s even when tailscale status=off and even when entire system is airplane mode (wifi off).

Tailscaled daemon should be less "active" when it is on but with zero link states or with no active netlinks.

Steps to reproduce

  1. start tailscaled
  2. connect to tailscale (tailscale up)
  3. disconnect tailscale (tailscale down)
  4. monitor interrupts/events via powertop

Kernel: 5.18.5-200.fc36.x86_64

Attached are two random powertop captures

  1. Tailscale=off Wifi=on https://pasteboard.co/kZTLt99Oq9YN.png
  2. Tailscale=off Wifi=off https://pasteboard.co/P45QqpjRVKrk.png

Are there any recent changes that introduced the issue?

No response

OS

Linux

OS version

Fedora 36

Tailscale version

1.26.1

Bug report

No response

bradfitz commented 2 years ago

Can you include a "tailscale bugreport" output when it's in this state so we can see what it's doing?

Qubitium commented 2 years ago

Can you include a "tailscale bugreport" output when it's in this state so we can see what it's doing?

@bradfitz

  1. Enabled tailscale up
  2. Watched few seconds of 1080p youtube.
  3. Stopped tailscale down
  4. Closed chrome (thorium) tab for youtube.
  5. launch powertop + tailscale bugreport

FYI, in Shanghai, so youtube would be accessible only via tailscale + exitnode.

The battery reports a discharge rate of 24.8 W
The energy consumed was 541 J
The estimated remaining time is 2 hours, 18 minutes

Summary: 2097.2 wakeups/second,  0.0 GPU ops/seconds, 0.0 VFS ops/sec and 65.2% CPU use

Power est.              Usage       Events/s    Category       Description
  7.19 W      0.0 pkts/s                Device         nic:tailscale0
  1.69 W      4.5 ms/s     513.5        Timer          tick_sched_timer
  553 mW      6.1 ms/s     168.0        Interrupt      [10] AMDI0010:00
  487 mW     22.9 ms/s     148.0        Process        [PID 2320] /usr/bin/kwin_wayland --wayland-fd 5 --socket wayland-0 --xwayland-fd 6 --xwayland-fd 7 --xwayland-display :1 --xway
  351 mW      1.8 ms/s     106.7        Process        [PID 55495] /usr/sbin/tailscaled --state=/var/lib/tailscale/tailscaled.state --socket=/run/tailscale/tailscaled.sock --port 416
  344 mW      2.3 ms/s     104.6        Process        [PID 497] [irq/49-ELAN0768]
  319 mW     24.8 ms/s      96.8        Process        [PID 43858] /usr/bin/konsole
  311 mW      4.1 ms/s      94.6        Interrupt      [7] sched(softirq)
  309 mW      4.6 ms/s      93.9        Interrupt      [70] amdgpu
  184 mW    566.4 µs/s      56.0        Process        [PID 16] [rcu_preempt]
  125 mW      8.2 ms/s      38.1        Process        [PID 2467] /usr/bin/plasmashell --no-respawn
  111 mW      1.0 ms/s      33.8        Interrupt      [7] pinctrl_amd
  108 mW      4.5 ms/s      32.9        Process        [PID 42572] thorium
  104 mW      1.2 ms/s      31.6        kWork          kcryptd_crypt

BUG-19e9beec282638e6a430b5a48c4f967a80172f66ac7ac73663ad2961ecf19a61-20220620144029Z-085d107f9bc8fea5

Qubitium commented 2 years ago

Another powertop state capture with tailscale down

The battery reports a discharge rate of 5.39 W
The energy consumed was 130 J
The estimated remaining time is 10 hours, 18 minutes

Summary: 2591.9 wakeups/second,  0.0 GPU ops/seconds, 0.0 VFS ops/sec and 24.6% CPU use

Power est.              Usage       Events/s    Category       Description
  2.32 W      8.4 ms/s     775.7        Timer          tick_sched_timer
  1.77 W      0.0 pkts/s                Device         nic:tailscale0
  669 mW      6.2 ms/s     223.4        Interrupt      [10] AMDI0010:00
  448 mW      1.5 pkts/s                Device         Network interface: wlp2s0 (ath11k_pci)
  442 mW      4.3 ms/s     147.6        Process        [PID 497] [irq/49-ELAN0768]
  382 mW    100.0%                      Device         Radio device: btusb
  350 mW     10.9 ms/s     116.9        Process        [PID 2376] /usr/bin/Xwayland :1 -auth /run/user/1000/xauth_saIPkL -listen 51 -listen 52 -displayfd 43 -rootless -wm 46
  325 mW     31.6 ms/s     108.4        Process        [PID 2320] /usr/bin/kwin_wayland --wayland-fd 5 --socket wayland-0 --xwayland-fd 6 --xwayland-fd 7 --xwayland-display :1 --xway
  318 mW      5.3 ms/s     106.1        Interrupt      [70] amdgpu
  301 mW     13.5 ms/s     100.6        Process        [PID 57038] /tmp/.mount_ThoriuEIsKeZ/usr/bin/thorium --type=renderer --enable-crashpad --crashpad-handler-pid=56617 --enable-cr
  290 mW      4.1 ms/s      96.9        Interrupt      [7] pinctrl_amd
  258 mW      7.8 ms/s      86.3        Interrupt      [7] sched(softirq)
  230 mW     10.0 ms/s      76.9        Process        [PID 56691] /tmp/.mount_ThoriuEIsKeZ/usr/bin/thorium --type=gpu-process --enable-crashpad --crashpad-handler-pid=56617 --enable
  226 mW     13.9 ms/s      75.5        Process        [PID 56644] /tmp/.mount_ThoriuEIsKeZ/usr/bin/thorium --type=gpu-process --enable-crashpad --crashpad-handler-pid=56617 --enable
  200 mW      5.0 ms/s      66.9        Process        [PID 57036] /tmp/.mount_ThoriuEIsKeZ/usr/bin/thorium --type=renderer --enable-crashpad --crashpad-handler-pid=56617 --enable-cr
  183 mW      0.8 ms/s      61.0        Process        [PID 16] [rcu_preempt]
  147 mW     14.7 ms/s      49.2        Process        [PID 57032] /tmp/.mount_ThoriuEIsKeZ/usr/bin/thorium --type=renderer --enable-crashpad --crashpad-handler-pid=56617 --enable-cr
  125 mW     19.0 ms/s      41.8        Process        [PID 56615] /tmp/.mount_ThoriuEIsKeZ/usr/bin/thorium --no-default-browser-check --use-gl=desktop --enable-experimental-web-plat
  123 mW      1.2 ms/s      41.1        Process        [PID 628] [gfx_0.0.0]
  114 mW      7.4 ms/s      37.9        Process        [PID 2371] /usr/bin/kwin_wayland --wayland-fd 5 --socket wayland-0 --xwayland-fd 6 --xwayland-fd 7 --xwayland-display :1 --xway
  103 mW      3.4 ms/s      34.5        Process        [PID 56689] /tmp/.mount_ThoriuEIsKeZ/usr/bin/thorium --type=gpu-process --enable-crashpad --crashpad-handler-pid=56617 --enable
 88.3 mW      3.8 ms/s      29.5        Process        [PID 56634] /tmp/.mount_ThoriuEIsKeZ/usr/bin/thorium --no-default-browser-check --use-gl=desktop --enable-experimental-web-plat
 83.1 mW     16.1 ms/s      27.8        Process        [PID 43858] /usr/bin/konsole
 83.1 mW    384.7 µs/s      27.8        Process        [PID 55495] /usr/sbin/tailscaled --state=/var/lib/tailscale/tailscaled.state --socket=/run/tailscale/tailscaled.sock --port 416
 62.8 mW      0.0 µs/s      21.0        kWork          commit_work
 56.6 mW     68.0 µs/s      18.9        kWork          kfree_rcu_monitor
 53.6 mW     80.5 µs/s      17.9        kWork          kfree_rcu_work
 49.7 mW      7.9 ms/s      16.6        Process        [PID 2559] /usr/bin/ksystemstats
 45.4 mW    164.7 µs/s      15.2        kWork          psi_avgs_work
 37.5 mW      6.3 ms/s      12.5        Process        [PID 2467] /usr/bin/plasmashell --no-respawn
 32.9 mW    139.5 µs/s      11.0        kWork          mix_interrupt_randomness
 28.6 mW     36.8 µs/s       9.5        kWork          toggle_allocation_gate
 27.5 mW      5.1%                      Device         Display backlight
 25.6 mW    476.7 µs/s       8.6        Process        [PID 158] [kworker/2:1H]
 18.3 mW      3.9 ms/s       6.1        Process        [PID 55503] /usr/sbin/tailscaled --state=/var/lib/tailscale/tailscaled.state --socket=/run/tailscale/tailscaled.sock --port 416
 17.4 mW    432.9 µs/s       5.8        kWork          kcryptd_crypt
 17.4 mW     34.3 µs/s       5.8        Process        [PID 8457] lxd --logfile /var/snap/lxd/common/lxd/logs/lxd.log --group lxd
 15.8 mW      2.8 ms/s       5.3        Process        [PID 55606] /usr/sbin/tailscaled --state=/var/lib/tailscale/tailscaled.state --socket=/run/tailscale/tailscaled.sock --port 416
 14.0 mW    673.5 µs/s       4.7        kWork          btrfs_work_helper
 13.3 mW    633.7 µs/s       4.4        Process        [PID 2477] /usr/bin/plasmashell --no-respawn
 12.1 mW      1.1 ms/s       4.0        Process        [PID 2984] /usr/bin/ibus-daemon -r --xim
 11.8 mW    131.5 µs/s       3.9        kWork          amd_sfh_work_buffer
 10.7 mW    141.6 µs/s       3.6        Process        [PID 3087] /usr/sbin/mysqld --defaults-file=/home/diego/.local/share/akonadi/mysql.conf --datadir=/home/diego/.local/share/akon
 10.6 mW    144.3 µs/s       3.5        Process        [PID 921] [dmcrypt_write/2]
 10.4 mW     42.5 µs/s       3.5        Timer          watchdog_timer_fn
 9.68 mW    148.7 µs/s       3.2        Interrupt      [6] tasklet(softirq)
 9.08 mW    197.3 µs/s       3.0        Process        [PID 1602] dbus-broker --log 4 --controller 9 --machine-id cd3c549134fc487ba4a0f7d3deb7f524 --max-bytes 536870912 --max-fds 409
bradfitz commented 2 years ago

BUG-19e9beec282638e6a430b5a48c4f967a80172f66ac7ac73663ad2961ecf19a61-20220620144029Z-085d107f9bc8fea5

You had just stopped Tailscale at that point.

If your complaint is that Tailscale is doing too much work when stopped, let's look at a long period of it being stopped and see what it's doing.

Please take one bug report right after stopping it, then wait a couple hours (or in the morning after sleeping) and take another bug report, and then we can look at the time period between those bug reports to see what it was doing.

So far the only thing I see is that we don't stop subscribing to netlink events about link changes.

But those powertop dumps don't really say anything useful. 0.0 pkts/s isn't very damning. 😄

Qubitium commented 2 years ago

I have to disagree. My expecation is that the daemon should do nothing to wake the cpu up when tailscale is offline state. Here is the capture when wifi is off and tailscaled is restarted. It actively wakes up the cpu more often than my UI konsole which is very strange for an offline state. Unless, tailscale client offline != daemon offline?

Another easy test is to just watch the tailscaled daemon chew up cpu time when state is down and laptop is airplane (wifi/net disabled) mode. What is it doing? It should be doing almost nothing, or nothing in the perfect world. At 400mhz (lowest idle) ryzen 3+ 6850hs cpu, tailscaled continues to chew up to 1% cpu time.

The battery reports a discharge rate of 6.00 W
The energy consumed was 113 J
The estimated remaining time is 8 hours, 40 minutes

Summary: 1062.6 wakeups/second,  0.0 GPU ops/seconds, 0.0 VFS ops/sec and 8.5% CPU use

Power est.              Usage       Events/s    Category       Description
  1.48 W      0.0 pkts/s                Device         nic:tailscale0
  575 mW      3.9 ms/s     355.5        Timer          tick_sched_timer
  332 mW      3.6 ms/s     204.8        Interrupt      [10] AMDI0010:00
 98.1 mW     10.2 ms/s      59.2        Process        [PID 2320] /usr/bin/kwin_wayland --wayland-fd 5 --socket wayland-0 --xwayland-fd 6 --xwayland-fd 7 --xwayland-display :1 --xway
 97.4 mW      4.2 ms/s      59.7        Interrupt      [70] amdgpu
 82.9 mW      1.4 ms/s      51.2        Process        [PID 497] [irq/49-ELAN0768]
 68.4 mW      8.0 ms/s      41.1        Process        [PID 43858] /usr/bin/konsole
 64.6 mW    429.8 µs/s      40.0        Process        [PID 63037] /usr/sbin/tailscaled --state=/var/lib/tailscale/tailscaled.state --socket=/run/tailscale/tailscaled.sock --port 416
 53.1 mW      3.6 ms/s      32.3        Interrupt      [7] sched(softirq)
 52.4 mW    459.1 µs/s      32.4        Process        [PID 16] [rcu_preempt]
 26.1 mW      8.3 ms/s      14.9        Process        [PID 2559] /usr/bin/ksystemstats
 25.1 mW    551.5 µs/s      15.4        Interrupt      [7] pinctrl_amd
 24.4 mW    166.6 µs/s      15.1        kWork          psi_avgs_work
 23.3 mW      5.7 ms/s      13.6        Process        [PID 2467] /usr/bin/plasmashell --no-respawn
 15.5 mW     46.4 µs/s       9.6        kWork          toggle_allocation_gate
 11.1 mW      0.0 µs/s       6.9        kWork          commit_work
 10.5 mW      4.2 ms/s       5.9        Process        [PID 63039] /usr/sbin/tailscaled --state=/var/lib/tailscale/tailscaled.state --socket=/run/tailscale/tailscaled.sock --port 416
 10.4 mW    435.1 µs/s       6.4        kWork          kcryptd_crypt
 9.46 mW     17.1 µs/s       5.9        kWork          mix_interrupt_randomness
 8.28 mW      3.3 ms/s       4.6        Process        [PID 63437] /usr/sbin/tailscaled --state=/var/lib/tailscale/tailscaled.state --socket=/run/tailscale/tailscaled.sock --port 416
 7.66 mW    167.9 µs/s       4.7        kWork          amd_sfh_work_buffer
 7.54 mW      9.1 µs/s       4.7        kWork          flush_to_ldisc

EDIT: Ignore the mW values as they are inaccurate estimates. The important items are the event/s. The tailscaled daemons has several threads and add them all togehter it is a lot of events for an offline state on wifi disabled.

bradfitz commented 2 years ago

I agree with you that Tailscale shouldn't waste power/CPU when it's off.

I want to find out why, but I need the information I requested in order to help you.

The information that you think is helpful to repeatedly paste on this bug is not the information that will help us fix the issue.

Qubitium commented 2 years ago

Just had a very strange 100% tailscaled cpu spke, which lasted around 30s causing laptop to throttle from 6w to 30w. During this time, it was using 100% cpu. Thought it was caused by browser until I checked htop and found tailscaled as casue.

The bug report was taken about ~45s after the incidient. At the exact moment of the 100% cpu spike, the wifi network was only showing about 35KB/s in traffic.

BUG-19e9beec282638e6a430b5a48c4f967a80172f66ac7ac73663ad2961ecf19a61-20220621112743Z-aff9d220888a0f92

UPDATE:

Caught tailscaled using 100% cpu on a fresh system reboot for a solid 3-4 minutes. The following two bugreports are captured during minute ~2 and ~4 of the super high cpu usage. There was minimal wifi traffic over that time (burts of 60KB/s max)

BUG-19e9beec282638e6a430b5a48c4f967a80172f66ac7ac73663ad2961ecf19a61-20220621175611Z-d200d44e763ce2a9 BUG-19e9beec282638e6a430b5a48c4f967a80172f66ac7ac73663ad2961ecf19a61-20220621175708Z-a02b2c9873706b5a

Qubitium commented 2 years ago

@bradfitz

1) systemctl start tailscaled 2) tailscale down 3) systemctl restart tailscaled

tailscaled daemon is now active and tailscale is in offline state

--- about 40 minutes later ---

BUG-19e9beec282638e6a430b5a48c4f967a80172f66ac7ac73663ad2961ecf19a61-20220621153807Z-5ecc090331c716b0

observations:

1) tailscaled spawns 22 threads but only about 3 are cpu active in offline state. 2 are very active. 1 is extremely active chewing up cpu non-stop at a rate of 0.6% with frequent spikes >1.2% and seldom spikes to >4% 2) tailscaled process is using 1% cpu resource on-avg according to atop 3) according to powertop, 2 threads of tailscaled most active threads is waking up cpu (events) at a rate of around ~3-6/s for each of the two threads. There is a 3rd tailscaled thread that has wakeup events at a high rate of 20-40/s but each event appears to be very brief as to consume almost no cpu. The 2 most active threads has tangible cpu usage. The high wakeup 3rd thead does not.

bradfitz commented 2 years ago

--- about 40 minutes later ---

BUG-19e9beec282638e6a430b5a48c4f967a80172f66ac7ac73663ad2961ecf19a61-20220621153807Z-5ecc090331c716b0

2022-06-21 23:09:25.401530149 +0800 +0800: Switching ipn state NoState -> Stopped (WantRunning=false, nm=false)
...
2022-06-21 23:09:29.597835433 +0800 +0800: control: cancelMapSafely: synced=false
2022-06-21 23:09:29.597840262 +0800 +0800: control: cancelMapSafely: channel was full
2022-06-21 23:38:07.356598408 +0800 +0800: user bugreport: BUG-19e9beec282638e6a430b5a48c4f967a80172f66ac7ac73663ad2961ecf19a61-20220621153807Z-5ecc090331c716b0

Yeah, I see the 30 minute gap there in the logs, but there was no activity after the ~4 seconds of stop activity.

bradfitz commented 2 years ago

With clientmetrics,

2022-06-21 23:09:29.597840262 +0800 +0800: control: cancelMapSafely: channel was full
2022-06-21 23:38:07.356598408 +0800 +0800: user bugreport: BUG-19e9beec282638e6a430b5a48c4f967a80172f66ac7ac73663ad2961ecf19a61-20220621153807Z-5ecc090331c716b0
        portmap_pxp_response: 2
        portmap_pcp_sent: 1
        portmap_pcp_ok: 1
        portmap_pmp_sent: 1
        portmap_pmp_ok: 1
        portmap_upnp_sent: 1
        portmap_upnp_response: 4
        portmap_upnp_ok: 1
        portmap_upnp_updated_meta: 1
        controlclient_map_requests: 1
        controlclient_map_requests_poll: 1
        controlclient_map_response_message: 1
        controlclient_map_response_map: 1
        tstun_out_to_wg: 12
        tstun_out_to_wg_drop: 12
        tstun_out_to_wg_drop_filter: 10
        netcheck_report: 1
        netcheck_report_full: 1
        netcheck_stun_send_ipv4: 12
        netcheck_stun_recv_ipv4: 8
        gauge_magicsock_netmap_num_peers: 107
        magicsock_restun_calls: 3
        magicsock_update_endpoints: 3
        wgengine_minor_changes: 1

So looks like the portmapper client isn't shutting down on stop.

Qubitium commented 2 years ago

Just tested booting system with ipv6 completedly disabled via ipv6.disable=1 kernel param and tailscaled cpu 1% leech issue appears to be reduced.

With ipv6 off, tailscaled (daemon running, status=off) cpu usage for tailscaled daemon is now less than 1%.

Qubitium commented 2 years ago

Does this commit resolve this issue where portmapper is not shutdown on stop? @bradfitz https://github.com/tailscale/tailscale/commit/9280d396789015cfb3dc68c57d1daa44eaee6572

bradfitz commented 2 years ago

@diegomontoya, no, that's unrelated.

JustinAzoff commented 8 months ago

FWIW, there seems to be at least one bug with PowerTOP version 2.15 on my laptop where it reports that nic:tailscale0 uses 4w even with tailscale down and 0.0 pkts/s on the interface. This is 100% nonsense, as completely stopping tailscale at this point doesn't change the overall usage by .1w, let alone 4w. It's almost like powertop is attributing all the remaining unaccounted for power usage to nic:tailscale0 for some reason.

So, I wouldn't necessarily trust anything powertop says.

Qubitium commented 8 months ago

@JustinAzoff "mW" power estimation is fairly bogus from my observation as well. But interrupts/events are accurate.