tailscale / tailscale

The easiest, most secure way to use WireGuard and 2FA.
https://tailscale.com
BSD 3-Clause "New" or "Revised" License
16.89k stars 1.28k forks source link

RAM completely fills over time on Raspberry Pi 3B with Fedora IoT on kernel 6.8.4 and later #11888

Open Procsiab opened 3 weeks ago

Procsiab commented 3 weeks ago

What is the issue?

Description

Leaving the system in an "idle" state results in the exhaustion of 100MB RAM every hour, until the system becomes unresponsive and it reboots. Monitoring the situation with top does not show a process that stands out for memory usage, while the free memory still gets lower over time.

System info

I disabled every other service I added to the "vanilla" deployment of Fedora IoT and I am able to reproduce this issue onl;y when Tailscale is running, on two different Raspberry Pies. NOTE that I am not able to reproduce the issue anymore if I roll back to the deployment 39.20240403.0 which bundles the kernel 6.7.11; every deployment version committed later includes the kernel 6.8, hence my assumption that this can be related.

For reference, I post here the diff between the Fedora IoT latest working deployment, and the first one that shows the issue:

39.20240403.0   d3e4c26cd0e28da506cd60f99b7278e34fbbcd12cb015fc68f038d380b2a311f

39.20240407.0   094f9134bea14944f507c4bb881496a73b660bdbafba687ed5e90bfbf8c634df    fwupd 1.9.15-1.fc39 -> 1.9.16-1.fc39
                                                                                    fwupd-plugin-modem-manager 1.9.15-1.fc39 -> 1.9.16-1.fc39
                                                                                    fwupd-plugin-uefi-capsule-data 1.9.15-1.fc39 -> 1.9.16-1.fc39
                                                                                    kernel 6.7.11-200.fc39 -> 6.8.4-200.fc39
                                                                                    kernel-core 6.7.11-200.fc39 -> 6.8.4-200.fc39
                                                                                    kernel-modules 6.7.11-200.fc39 -> 6.8.4-200.fc39
                                                                                    kernel-modules-core 6.7.11-200.fc39 -> 6.8.4-200.fc39
                                                                                    kernel-tools 6.7.11-200.fc39 -> 6.8.4-200.fc39
                                                                                    kernel-tools-libs 6.7.11-200.fc39 -> 6.8.4-200.fc39
                                                                                    libxmlb 0.3.15-1.fc39 -> 0.3.17-1.fc39

Steps to reproduce

NOTE: I am not able to reproduce the issue anymore if I stop the Tailscale SystemD service; however, stopping the client does not free up the allocated RAM, that needs a reboot.

Are there any recent changes that introduced the issue?

Deploy a Fedora IoT version for AARCH64 with kernel 6.8 included: on x86_64 the issue is not reproducible (tested on both virtual and physical machines), with any of the Fedora IoT deployments on which it happens on aarch64 instead.

OS

Linux

OS version

Fedora IoT 39 and 40

Tailscale version

1.44.1 to 1.64.0

Other software

I am using a Headscale 0.22.3 server to connect the clients

Bug report

No response

bradfitz commented 3 weeks ago

Can you run tailscale debug --mem-profile=tailscale.mem.pprof and attach that file here? You might have to zip it or tar.gz it to let github accept it.

Procsiab commented 3 weeks ago

After testing some permutations of the software and system versions yesterday, I left installed the tailscale client 1.48.1, and with it running I gathered the requested information (since it is a version which triggers the issue I am reporting).

I left one ARM64 system with the deployment 39.20240403.0 running overnight, and the I captured the debug info with the command @bradfitz provided; this first file is called tailscale_kernel6711.mem.pprof

Afterwards on the same system, I deployed the version 39.20240407.0 (you can find the packages diff among the two in my first post) and then let it run for four hours during which the memory slowly filled up; finally, I captured again the debug info with the same command. The second file is called tailscale_kernel684.mem.pprof

I provide a ZIP archive containing the two files. tailscale.mem.pprof.zip

Additional info

On the affected system, I started the client in the following way:

tailscale up --accept-dns=true --login-server=https://myheadscale.org

And the contents of /etc/defaults/tailscaled are the following:

PORT="41641"
FLAGS=""
TS_NO_LOGS_NO_SUPPORT=true
bradfitz commented 3 weeks ago

I only see ~45 MB of memory in those pprof files, not hundreds.

% go tool pprof tailscale_kernel6711.mem.pprof
File: tailscaled
Type: inuse_space
Time: Apr 27, 2024 at 1:42am (PDT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 46969.28kB, 97.87% of 47993.68kB total
Showing top 10 nodes out of 83
      flat  flat%   sum%        cum   cum%
32679.99kB 68.09% 68.09% 32679.99kB 68.09%  github.com/tailscale/wireguard-go/device.(*Device).PopulatePools.func3
    6536kB 13.62% 81.71%     6536kB 13.62%  tailscale.com/net/tstun.wrap
 2064.04kB  4.30% 86.01%  2064.04kB  4.30%  github.com/tailscale/wireguard-go/tun.newTCPGROTable
 1401.24kB  2.92% 88.93%  1401.24kB  2.92%  github.com/klauspost/compress/zstd.encoderOptions.encoder
 1184.27kB  2.47% 91.40%  1184.27kB  2.47%  github.com/klauspost/compress/zstd.(*fastBase).ensureHist
 1024.05kB  2.13% 93.53%  1024.05kB  2.13%  tailscale.com/wgengine/magicsock.(*endpoint).handlePongConnLocked
  540.51kB  1.13% 94.66%   540.51kB  1.13%  github.com/tailscale/wireguard-go/device.newHandshakeQueue
  513.50kB  1.07% 95.73%   513.50kB  1.07%  bytes.growSlice
     513kB  1.07% 96.80%      513kB  1.07%  vendor/golang.org/x/net/http2/hpack.newInternalNode
  512.69kB  1.07% 97.87%   512.69kB  1.07%  encoding/pem.Decode
(pprof) %                                                                            

Sure it's the Tailscale process?

Procsiab commented 3 weeks ago

I am sure that from visually inspecting the memory usage with top nothing stands out, neither the Tailscale client; however I am also sure that if I turn off the Tailscale client's systemd unit file the memory stops filling, and if I reboot in this state (with the client disabled and not starting at boot) the memory never fills up again. Moreover, I have tested this scenario on other virtual and physical machines all of them running the same deployment of Fedora IoT and Tailscale version, and it seems only reproducible on AARCH64 with kernel 6.8

wolf-yuan-6115 commented 2 weeks ago

I am sure that from visually inspecting the memory usage with top nothing stands out, neither the Tailscale client; however I am also sure that if I turn off the Tailscale client's systemd unit file the memory stops filling, and if I reboot in this state (with the client disabled and not starting at boot) the memory never fills up again. Moreover, I have tested this scenario on other virtual and physical machines all of them running the same deployment of Fedora IoT and Tailscale version, and it seems only reproducible on AARCH64 with kernel 6.8

Hello, it's not Tailscale's fault in my case, see https://discussion.fedoraproject.org/t/high-memory-usage-in-f40-on-rpi-4-unable-to-find-which-process-used-them/114598/ You are probably not the only one who faces this issue.

Procsiab commented 2 weeks ago

Thanks @wolf-yuan-6115 for your reply: I did not notice the discussion you started on the fedora forum because I had been looking on the internet a couple of weeks ago and I was convinced that I was not interested since in the title and first message the Fedora 40 upgrade was discussed, while I am able to reproduce my issue also with Fedora 39 (the IoT version).

For the time being, I'll join the discussion over to the Fedora forum and close this one, while we figure out if there is anything to do with the kernel 6.8 itself on ARM64.