Closed horihel closed 1 month ago
@horihel Thanks for reporting this, are you able to verify that your host machine still has access to the internet when you encounter this state?
Looking at the logs I see a lot of the following:
2024-01-12T14:53:53.265Z: Running connectivity test with timeout of 5000 ms
2024-01-12T14:53:53.265Z: Got error while checking connectivity: timed out after 5000 ms
are you able to verify that your host machine still has access to the internet
Could this be related to VPN timeouts? Fortinet has both an auth-timeout
(default 8 hours) and an idle-timeout
(default 5 min) that could be interfering here. Actual timeout values may be configured differently.
@Nino-K We should still try to repro if simply disconnecting from the network causes the sluggish UI behaviour. Rancher Desktop should work properly in offline mode as long as all required assets are already cached.
@horihel Thanks for reporting this, are you able to verify that your host machine still has access to the internet when you encounter this state?
Looking at the logs I see a lot of the following:
I can confirm that connectivity to the internet is up the whole time. What does the "connectivity test" do in detail? Maybe I can find out why it logs errors on my machine.
I tried disconnecting from the network, which did lead to rancher detecting itself as offline, but (according to diagnostics log) not with a timeout (instead it did a getaddrinfo ENOTFOUND docs.rancherdesktop.io
).
I also tried blocking Internet access for rancher desktop only by blocking its process using the Windows Firewall, but could also not produce a timeout, as the firewall directly reports back and the connectivity check fails instantly with reason: connect EACCES 185.199.109.153:80
so far simply disconnecting does not reproduce the problem, as a simple disconnect will not lead to timeouts (assuming those timeouts are the cause of the problem)
What does the "connectivity test" do in detail?
It fetches http://docs.rancherdesktop.io/ and confirms it gets a redirect (301) response:
Waiting a little made the problem happen again. Network shows offline, GUI takes about 10 seconds to register a click. I tried curling http://docs.rancherdesktop.io and it looks fine.
# curl -f http://docs.rancherdesktop.io
<html>
<head><title>301 Moved Permanently</title></head>
<body>
<center><h1>301 Moved Permanently</h1></center>
<hr><center>nginx</center>
</body>
</html>
This is from the "rancher-desktop" wsl2 instance. Host instance is working fine as well and is able to contact the site too.
I tried disabling FortiVPN (with rancher desktop already hanging) and it did not recover.
I also noticed that "diagnostics.log" is not seeing any more error messages since about 10 minutes now. It's 09:57 UTC and the last log line is
2024-01-16T09:46:13.853Z: Got error while checking connectivity: timed out after 5000 ms
So the network connectivity check might not even be called any more.
I also noticed that at the time of the hang, Rancher Desktop.exe (the parent executable) is using 2GB of RAM, while it sits at around 144MB working set when started fresh.
I suspect rancher-desktop is leaking memory, but I'm not sure if that's connected to the hang issue.
the time where memory shoots up roughly corresponds to me installing a few big helm charts.
@horihel are you able to disable the VPN client "FortiVPN" and see if you can reproduce the hanging state? I suspect it is the VPN Client that might be interfering here.
@Nino-K not for very long, but I'll try. I cannot uninstall it, but disconnecting it should be possible for a time.
FYI Rancher Desktop updated to 1.12.1 today. According to the release notes I don't expect changes in behaviour
@Nino-K not for very long, but I'll try. I cannot uninstall it, but disconnecting it should be possible for a time.
FYI Rancher Desktop updated to 1.12.1 today. According to the release notes I don't expect changes in behaviour
Yeah, I just meant disabling it temporarily. Thanks
Okay - yesterday I started RD with VPN disconnected and while it had trouble starting (wsl2 startup failure - a wsl --update fixed that) it ran without a problem (and without VPN) till about lunchtime when I had to connect to VPN. RD stayed responsive and memory usage was stable at around 100-150MB (parent process only) - despite VPN being active the second half of the day.
Today I'm in a different office with no need for VPN (but also no functional restrictions with outgoing traffic) - so VPN is permanently off, but after 90min RD is hung again and is already up to 1G of RAM and steadily growing.
To me this doesn't look like FortiVPN is the cause of the issue... at least its state (connected/disconnected) doesn't seem to be related to RD hanging and leaking memory.
I started my session with remote-debugger attached today and did a few things. Here's the network trace for a "page reload" with performance trace enabled:
It's a bit odd that just the index.html (which is static?) takes almost 6s to load
Same if you look at the performance tab...
Unfortunately I don't know where to dig from here.
tried 1.13.0 today, same problem.
I haven't done extensive testing with RD in several months, I don't use a VPN, I am currently on 1.13.1, and I am experiencing this issue as well. It seems like I can deploy/destroy pods fine, but the GUI becomes very slow as stated by @horihel. WSL is working fine as well. This used to never be a problem; I'd have my cluster running with pods for days at a time, but now it seems like even without having pods running, I'll have to force quit and factory reset RD at least once a day.
I can confirm that it seems RD is leaking RAM very consistently:
@mikeseese & @horihel Can confirm I'm also experiencing this as of late with 1.13.1 plus the specific build version I am currently running (Version: 1.13.1-490-g97aaae25)
to address K3S cluster startup issues with AntiVirus rmdir
issues; as discovered under #6839.
My personal experience is when clicking the system tray icon to get the RD dashboard OR cluster dashboard to appear, the left/right-clicks do nothing with the application; forcing me to force kill the application from Task Manager/PID. Usually this sluggish/unresponsiveness is from letting the cluster idle in the background as @horihel described above. The Network status will show "Offline" and if there's the chance the dashboard does come up, the entire application is extremely sluggish/resource intensive.
If this behavior occurs again, I'll capture system resource logs to check for memory leakage.
FYSA we are currently running a Sonicwall VPN appliance but are in the process of migrating to a Fortinet VPN appliance within the next month. I suspect we'd see similar behavior with RD acting sluggish running in a "Tunnel-All" mode.
still happens on 1.14.2 even without any active VPN.
@Nino-K I believe this is an outstanding bug that needs addressed.
I'm continuing to see this unresponsiveness from Rancher Desktop when uptime exceeds ~24 hours; though I have not nailed down a specific uptime threshold where this behavior specifically occurs. Overall when this occurs, I cannot access my RD app, cluster dashboard or any other items related to RD until I kill the application.
Is there anything @horihel and I can test on our machines to provide verbose logging details to assist with moving this bug into higher priority review by the team?
I concur that I'm also experiencing this unresponsiveness behavior while OFF our corporate VPN. When the RD application becomes unresponsive like this, requires full Task Manager kill of the Rancher Desktop application....then full restart of RD requiring rebuild of my cluster.
Here's another capture of logs: logs.zip
I've provided things in local time below as it helps paint a picture, but note that the logs are 7 hours after local time (12AM local => 7AM UTC log time)
Some notable considerations:
diagnostics.log
shows connectivity test failures for ~2 hours between 5:25AM and 7:19AM local timediagnostics.log
never shows the connectivity test passingTurn off hard disk after 20 minutes
(I don't have any spinning hard disks. My C:\ drive is a 1TB NVMe SSD and I have a data D:\ drive that's an 8TB NVMe SSD RAID0 of 4x2TB drives) and Link State Power Management (aka Active State Power Management ASPM
was set to Moderate power savings
. I'm not sure if either of these would have impacted anything; perhaps the latter could have turned of the ethernet NIC on the motherboard? Note that RAM was still climbing when I signed in around 10AM despite the connectivity outage was around 10AMimages.log
had quite a bit of logging (~50MB); it has many, consistent logs from 1:30AM to 7:19AM (the last log here was 4 seconds after the last connectivity log in `diagnostics.log)k3s.log
has a periodic COMPACT
set of logs every 5 minutes from 1PM the day before (shortly after bootstrapping it seems) and doesn't stop (and is still going at 5m intervals despite the application being slow). I'm guessing this is expected behavior.My gut feeling is that the 5:25AM network connectivity "outage" is a symptom of the memory leak and the 7:19AM logging outage is a further symptom of the memory leak causing systems to hang. Perhaps the images.log
is a sign to the root cause. I will continue to investigate, but hope to get some insight from the core devs.
I was able to capture a memory heap snapshot:
I can upload the 2GB file if it's helpful, just let me know. It seems that a big chunk of RAM is just due to creating an FSWatcher
that doesn't seem to be cleaning up.
There are two watch
functions, one in the Tray
class in watchForChanges
that directly creates an FSWatcher
. There's another one in pkg/rancher-desktop/backend/kube/client.ts
under WrappedWatch::watch
, but the underlying dependency function doesn't seem to use FS watching.
The Tray::watchForChanges
seems like a viable candidate as randomly pausing the application with the inspector does break there. CC @mook-as as you seem to be the author of the "new" logic 19 months ago and seemingly earmarked a change 2 weeks ago
Theoretically the use of the AbortController
here should be closing out the cyclic watching, and FWIW in the heap snapshot, it mentions it might be garbage collect the watcher (not sure how to really read this though):
Regardless, there's 600k+ open handles to this logic and it seems the attempt to prevent the memory leak isn't always working. Any thoughts @mook-as?
I was able to verify there is a leak in Tray::watchForChanges
by taking a heap snapshot on a fresh start, see that there were 2 FSWatcher
instances, change ~/.kube/config
, see that there were now 5 FSWatcher
instances (for whatever reason, there are 3 events for each file). The number keeps climbing each time the kubeconfig is changed.
I put together a simple fix that prevents this leak with a different approach to the intent of the cyclic nature of watchForChanges
of "let's not rebuild the tray config all the time" and will submit a PR shortly.
Edit/fun note: I was a little perplexed why there would be 200k (600k/3 events per change) changes to my kubeconfig in a 24 hour window, but I had the inkling that the watchers don't stop watching at all and still trigger a rewatch. Turns out this is the case. Every "change" (using vscode, make a change save once) will trigger ~4x (it varies slightly; perhaps some do get GCd) the previous number. Less than 10 saves gets you to 600k instances on 4^n.
Actual Behavior
Rancher Desktop silently freezes after a few hours of operation. The GUI is not exactly frozen - it is just extremely slow (draws/klicks take about 10 seconds each).
The only visible indicator is "Network" being "offline" at some point.
K8S might be still reachable through the API (k9s still works), but many bigger operations (helm) will just time out.
This might be related to #6057 , but in my case, a
taskkill /F /T /IM "rancher desktop.exe"
followed by awsl --shutdown
allows me to restart Rancher Desktop and continue working.Steps to Reproduce
run Rancher Desktop for a few hours, observe "network: offline" - after a few minutes, you'll wait for 10-20 seconds for any click on the GUI or Taskbar icon. Shutdown can be initiated, but will never complete.
Result
logs are unremarkable (to me). No errors that correspond to anything that might have happened after actually starting up the cluster.
Expected Behavior
no freeze
Additional Information
No response
Rancher Desktop Version
1.12.0 (but also happened on 1.11.x)
Rancher Desktop K8s Version
1.28.x
Which container engine are you using?
containerd (nerdctl)
What operating system are you using?
Windows
Operating System / Build Version
Windows 11 Pro 23H2 22631.3007
What CPU architecture are you using?
x64
Linux only: what package format did you use to install Rancher Desktop?
None
Windows User Only
FortiClient VPN is active during all of this. Split tunnel is active, Internet access is unrestricted.