Freeze after a few hours

horihel commented 8 months ago

Actual Behavior

Rancher Desktop silently freezes after a few hours of operation. The GUI is not exactly frozen - it is just extremely slow (draws/klicks take about 10 seconds each).

The only visible indicator is "Network" being "offline" at some point.

K8S might be still reachable through the API (k9s still works), but many bigger operations (helm) will just time out.

This might be related to #6057 , but in my case, a taskkill /F /T /IM "rancher desktop.exe" followed by a wsl --shutdown allows me to restart Rancher Desktop and continue working.

Steps to Reproduce

run Rancher Desktop for a few hours, observe "network: offline" - after a few minutes, you'll wait for 10-20 seconds for any click on the GUI or Taskbar icon. Shutdown can be initiated, but will never complete.

Result

logs are unremarkable (to me). No errors that correspond to anything that might have happened after actually starting up the cluster.

Expected Behavior

no freeze

Additional Information

No response

Rancher Desktop Version

1.12.0 (but also happened on 1.11.x)

Rancher Desktop K8s Version

1.28.x

Which container engine are you using?

containerd (nerdctl)

What operating system are you using?

Windows

Operating System / Build Version

Windows 11 Pro 23H2 22631.3007

What CPU architecture are you using?

x64

Linux only: what package format did you use to install Rancher Desktop?

None

Windows User Only

FortiClient VPN is active during all of this. Split tunnel is active, Internet access is unrestricted.

horihel commented 8 months ago

logs.zip

Nino-K commented 8 months ago

@horihel Thanks for reporting this, are you able to verify that your host machine still has access to the internet when you encounter this state?

Looking at the logs I see a lot of the following:

2024-01-12T14:53:53.265Z: Running connectivity test with timeout of 5000 ms
2024-01-12T14:53:53.265Z: Got error while checking connectivity: timed out after 5000 ms

jandubois commented 8 months ago

are you able to verify that your host machine still has access to the internet

Could this be related to VPN timeouts? Fortinet has both an auth-timeout (default 8 hours) and an idle-timeout (default 5 min) that could be interfering here. Actual timeout values may be configured differently.

jandubois commented 8 months ago

@Nino-K We should still try to repro if simply disconnecting from the network causes the sluggish UI behaviour. Rancher Desktop should work properly in offline mode as long as all required assets are already cached.

horihel commented 8 months ago

@horihel Thanks for reporting this, are you able to verify that your host machine still has access to the internet when you encounter this state?

Looking at the logs I see a lot of the following:

I can confirm that connectivity to the internet is up the whole time. What does the "connectivity test" do in detail? Maybe I can find out why it logs errors on my machine.

horihel commented 8 months ago

I tried disconnecting from the network, which did lead to rancher detecting itself as offline, but (according to diagnostics log) not with a timeout (instead it did a getaddrinfo ENOTFOUND docs.rancherdesktop.io). I also tried blocking Internet access for rancher desktop only by blocking its process using the Windows Firewall, but could also not produce a timeout, as the firewall directly reports back and the connectivity check fails instantly with reason: connect EACCES 185.199.109.153:80

so far simply disconnecting does not reproduce the problem, as a simple disconnect will not lead to timeouts (assuming those timeouts are the cause of the problem)

jandubois commented 8 months ago

What does the "connectivity test" do in detail?

It fetches http://docs.rancherdesktop.io/ and confirms it gets a redirect (301) response:

https://github.com/rancher-sandbox/rancher-desktop/blob/fcffd3cc071a9414dcf03c895792b5142116ffd4/pkg/rancher-desktop/main/diagnostics/connectedToInternet.ts#L34-L61

horihel commented 8 months ago

Waiting a little made the problem happen again. Network shows offline, GUI takes about 10 seconds to register a click. I tried curling http://docs.rancherdesktop.io and it looks fine.

# curl -f http://docs.rancherdesktop.io
<html>
<head><title>301 Moved Permanently</title></head>
<body>
<center><h1>301 Moved Permanently</h1></center>
<hr><center>nginx</center>
</body>
</html>

This is from the "rancher-desktop" wsl2 instance. Host instance is working fine as well and is able to contact the site too.

I tried disabling FortiVPN (with rancher desktop already hanging) and it did not recover.

I also noticed that "diagnostics.log" is not seeing any more error messages since about 10 minutes now. It's 09:57 UTC and the last log line is

2024-01-16T09:46:13.853Z: Got error while checking connectivity: timed out after 5000 ms

So the network connectivity check might not even be called any more.

horihel commented 8 months ago

I also noticed that at the time of the hang, Rancher Desktop.exe (the parent executable) is using 2GB of RAM, while it sits at around 144MB working set when started fresh.

horihel commented 8 months ago

I suspect rancher-desktop is leaking memory, but I'm not sure if that's connected to the hang issue. grafik

the time where memory shoots up roughly corresponds to me installing a few big helm charts.

Nino-K commented 8 months ago

@horihel are you able to disable the VPN client "FortiVPN" and see if you can reproduce the hanging state? I suspect it is the VPN Client that might be interfering here.

horihel commented 8 months ago

@Nino-K not for very long, but I'll try. I cannot uninstall it, but disconnecting it should be possible for a time.

FYI Rancher Desktop updated to 1.12.1 today. According to the release notes I don't expect changes in behaviour

Nino-K commented 8 months ago

@Nino-K not for very long, but I'll try. I cannot uninstall it, but disconnecting it should be possible for a time.

FYI Rancher Desktop updated to 1.12.1 today. According to the release notes I don't expect changes in behaviour

Yeah, I just meant disabling it temporarily. Thanks

horihel commented 7 months ago

Okay - yesterday I started RD with VPN disconnected and while it had trouble starting (wsl2 startup failure - a wsl --update fixed that) it ran without a problem (and without VPN) till about lunchtime when I had to connect to VPN. RD stayed responsive and memory usage was stable at around 100-150MB (parent process only) - despite VPN being active the second half of the day.

Today I'm in a different office with no need for VPN (but also no functional restrictions with outgoing traffic) - so VPN is permanently off, but after 90min RD is hung again and is already up to 1G of RAM and steadily growing.

To me this doesn't look like FortiVPN is the cause of the issue... at least its state (connected/disconnected) doesn't seem to be related to RD hanging and leaking memory.

horihel commented 7 months ago

I started my session with remote-debugger attached today and did a few things. Here's the network trace for a "page reload" with performance trace enabled: grafik

It's a bit odd that just the index.html (which is static?) takes almost 6s to load

Same if you look at the performance tab... grafik

Unfortunately I don't know where to dig from here.

horihel commented 6 months ago

tried 1.13.0 today, same problem.

mikeseese commented 3 months ago

I haven't done extensive testing with RD in several months, I don't use a VPN, I am currently on 1.13.1, and I am experiencing this issue as well. It seems like I can deploy/destroy pods fine, but the GUI becomes very slow as stated by @horihel. WSL is working fine as well. This used to never be a problem; I'd have my cluster running with pods for days at a time, but now it seems like even without having pods running, I'll have to force quit and factory reset RD at least once a day.

I can confirm that it seems RD is leaking RAM very consistently:

srrich commented 3 months ago

@mikeseese & @horihel Can confirm I'm also experiencing this as of late with 1.13.1 plus the specific build version I am currently running (Version: 1.13.1-490-g97aaae25) to address K3S cluster startup issues with AntiVirus rmdir issues; as discovered under #6839.

My personal experience is when clicking the system tray icon to get the RD dashboard OR cluster dashboard to appear, the left/right-clicks do nothing with the application; forcing me to force kill the application from Task Manager/PID. Usually this sluggish/unresponsiveness is from letting the cluster idle in the background as @horihel described above. The Network status will show "Offline" and if there's the chance the dashboard does come up, the entire application is extremely sluggish/resource intensive.

If this behavior occurs again, I'll capture system resource logs to check for memory leakage.

FYSA we are currently running a Sonicwall VPN appliance but are in the process of migrating to a Fortinet VPN appliance within the next month. I suspect we'd see similar behavior with RD acting sluggish running in a "Tunnel-All" mode.

horihel commented 2 months ago

still happens on 1.14.2 even without any active VPN.

srrich commented 2 months ago

@Nino-K I believe this is an outstanding bug that needs addressed.

I'm continuing to see this unresponsiveness from Rancher Desktop when uptime exceeds ~24 hours; though I have not nailed down a specific uptime threshold where this behavior specifically occurs. Overall when this occurs, I cannot access my RD app, cluster dashboard or any other items related to RD until I kill the application.

Is there anything @horihel and I can test on our machines to provide verbose logging details to assist with moving this bug into higher priority review by the team?

I concur that I'm also experiencing this unresponsiveness behavior while OFF our corporate VPN. When the RD application becomes unresponsive like this, requires full Task Manager kill of the Rancher Desktop application....then full restart of RD requiring rebuild of my cluster.

mikeseese commented 1 month ago

Here's another capture of logs: logs.zip

I've provided things in local time below as it helps paint a picture, but note that the logs are 7 hours after local time (12AM local => 7AM UTC log time)

Some notable considerations:

The computer is set to never sleep
The computer was locked overnight while I was sleeping
I don't use a VPN or proxy (don't have any installed)
It wasn't having issues when I went to bed around 12AM local time and didn't notice issues then
I got to my PC in the morning around 10AM when I noticed the high, changing RAM usage (~3.5GB) and the electron app was extremely slow
diagnostics.log shows connectivity test failures for ~2 hours between 5:25AM and 7:19AM local time
diagnostics.log never shows the connectivity test passing
The Windows Power Plan I was using had Turn off hard disk after 20 minutes (I don't have any spinning hard disks. My C:\ drive is a 1TB NVMe SSD and I have a data D:\ drive that's an 8TB NVMe SSD RAID0 of 4x2TB drives) and Link State Power Management (aka Active State Power Management ASPM was set to Moderate power savings. I'm not sure if either of these would have impacted anything; perhaps the latter could have turned of the ethernet NIC on the motherboard? Note that RAM was still climbing when I signed in around 10AM despite the connectivity outage was around 10AM
I checked the Windows Event Viewer System logs and found no indicators of the power plan turning anything off or on
images.log had quite a bit of logging (~50MB); it has many, consistent logs from 1:30AM to 7:19AM (the last log here was 4 seconds after the last connectivity log in `diagnostics.log)
k3s.log has a periodic COMPACT set of logs every 5 minutes from 1PM the day before (shortly after bootstrapping it seems) and doesn't stop (and is still going at 5m intervals despite the application being slow). I'm guessing this is expected behavior.

My gut feeling is that the 5:25AM network connectivity "outage" is a symptom of the memory leak and the 7:19AM logging outage is a further symptom of the memory leak causing systems to hang. Perhaps the images.log is a sign to the root cause. I will continue to investigate, but hope to get some insight from the core devs.

mikeseese commented 1 month ago

I was able to capture a memory heap snapshot:

I can upload the 2GB file if it's helpful, just let me know. It seems that a big chunk of RAM is just due to creating an FSWatcher that doesn't seem to be cleaning up.

There are two watch functions, one in the Tray class in watchForChanges that directly creates an FSWatcher. There's another one in pkg/rancher-desktop/backend/kube/client.ts under WrappedWatch::watch, but the underlying dependency function doesn't seem to use FS watching.

The Tray::watchForChanges seems like a viable candidate as randomly pausing the application with the inspector does break there. CC @mook-as as you seem to be the author of the "new" logic 19 months ago and seemingly earmarked a change 2 weeks ago

https://github.com/rancher-sandbox/rancher-desktop/blob/9c6a9bd222c62740d0e3460f9618a47a903e2818/pkg/rancher-desktop/main/tray.ts#L132-L159

Theoretically the use of the AbortController here should be closing out the cyclic watching, and FWIW in the heap snapshot, it mentions it might be garbage collect the watcher (not sure how to really read this though):

Regardless, there's 600k+ open handles to this logic and it seems the attempt to prevent the memory leak isn't always working. Any thoughts @mook-as?

mikeseese commented 1 month ago

I was able to verify there is a leak in Tray::watchForChanges by taking a heap snapshot on a fresh start, see that there were 2 FSWatcher instances, change ~/.kube/config, see that there were now 5 FSWatcher instances (for whatever reason, there are 3 events for each file). The number keeps climbing each time the kubeconfig is changed.

I put together a simple fix that prevents this leak with a different approach to the intent of the cyclic nature of watchForChanges of "let's not rebuild the tray config all the time" and will submit a PR shortly.

Edit/fun note: I was a little perplexed why there would be 200k (600k/3 events per change) changes to my kubeconfig in a 24 hour window, but I had the inkling that the watchers don't stop watching at all and still trigger a rewatch. Turns out this is the case. Every "change" (using vscode, make a change save once) will trigger ~4x (it varies slightly; perhaps some do get GCd) the previous number. Less than 10 saves gets you to 600k instances on 4^n.

rancher-sandbox / rancher-desktop