Open ryancurrah opened 1 year ago
Attached are the logs we captured when we re-produced the issue.
I can't reproduce this on macOS 13.1 on M1 either). I've done a factory reset, rebooted the host, did another factory reset, and the command always worked fine.
I've looked at the logs, and can't spot anything in there either.
On the "reproducible laptop" does this also happen after a factory reset? Or after rebooting the host?
Are there any errors in any of the networking logs at ~/Library/Application Support/rancher-desktop/lima/_networks
?
I am getting our IT team to send me an M1 Macbook so I can try to reproduce this issue. Another dev reported the same issue this morning. Not sure what they were doing to cause it though.
On the "reproducible laptop" it happens even after a factory reset, reboot, and fresh re-install.
The dev with the reproducible laptop needs to get some work done so they have uninstalled it for now. ~I am going to get our devs to post here when they get a freezing issue~. Meanwhile, I will try to get that laptop and re-produce it.
I am getting our IT team to send me an M1 Macbook so I can try to reproduce this issue. Another dev reported the same issue this morning. Not sure what they were doing to cause it though.
Thank you so much; this will be really helpful, as I've been unable to repro this myself.
Maybe also take a look at any anti-malware technology installed on your machines; maybe that is interfering with the virtualization code?
I have the same problem. I have tried a factory reset, reinstall, reboot everything, but rancher still hangs.
My colleagues who have the same anti-virus software installed did not have the problem.
Hi I 'm able to reproduce this frequently on my M1 running Monterrey 12.6.1/RD 1.7.0/k8s 1.25.4/Traefik disabled. What logs can I provide from ~/Library/Logs/rancher-desktop
to help debug this? Currently the RD UI shows Kubernetes is running but kubectl commands timeout with Unable to connect to the server: net/http: TLS handshake timeout
Tried quitting Rancher desktop and restarting a couple of times but same problem. I could restart the laptop and the problem might go away. I may need to do that to not be blocked with my work and/or look to minikube (which doesn't have a nice UI). But happy to provide logs and keep the laptop in this reproducible state for the next 24 hours or so if it helps.
tailed logs from the time it started to the time it stopped working.
1. steve.log
time="2023-01-16T11:09:37-08:00" level=info msg="Watching metadata for rbac.authorization.k8s.io/v1, Kind=RoleBinding"
time="2023-01-16T11:09:37-08:00" level=info msg="Watching metadata for apiregistration.k8s.io/v1, Kind=APIService"
time="2023-01-16T11:09:37-08:00" level=info msg="Watching metadata for /v1, Kind=Pod"
time="2023-01-16T11:09:37-08:00" level=info msg="Watching metadata for apps/v1, Kind=Deployment"
time="2023-01-16T11:09:37-08:00" level=info msg="Watching metadata for rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding"
time="2023-01-16T11:09:37-08:00" level=info msg="Watching metadata for events.k8s.io/v1, Kind=Event"
time="2023-01-16T11:09:37-08:00" level=info msg="Watching metadata for /v1, Kind=PodTemplate"
time="2023-01-16T11:09:37-08:00" level=info msg="Watching metadata for apps/v1, Kind=StatefulSet"
time="2023-01-16T11:09:37-08:00" level=info msg="Watching metadata for batch/v1, Kind=CronJob"
time="2023-01-16T11:09:37-08:00" level=info msg="Watching metadata for acme.cert-manager.io/v1, Kind=Order"
…
….. first sign of trouble ….
….
2023-01-16T19:10:04.881Z: stderr: time="2023-01-16T11:10:04-08:00" level=error msg="Failed to read API for groups map[metrics.k8s.io/v1beta1:the server is currently unable to handle the request]"
2023-01-16T19:13:01.329Z: stderr: W0116 11:13:01.327098 46860 reflector.go:443] pkg/mod/github.com/rancher/client-go@v1.24.0-rancher1/tools/cache/reflector.go:168: watch of *summary.SummarizedObject ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0116 11:13:01.327114 46860 reflector.go:443] pkg/mod/github.com/rancher/client-go@v1.24.0-rancher1/tools/cache/reflector.go:168: watch of *summary.SummarizedObject ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
….
…. many of these …..
….
W0116 11:13:01.328829 46860 reflector.go:443] pkg/mod/github.com/rancher/client-go@v1.24.0-rancher1/tools/cache/reflector.go:168: watch of *summary.SummarizedObject ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0116 11:13:01.328880 46860 reflector.go:443] pkg/mod/github.com/rancher/client-go@v1.24.0-rancher1/tools/cache/reflector.go:168: watch of *summary.SummarizedObject ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
….
…. TLS handshake timeouts. After this kubectl stops working roughly …..
….
2023-01-16T19:13:12.133Z: stderr: W0116 11:13:12.132748 46860 reflector.go:325] pkg/mod/github.com/rancher/client-go@v1.24.0-rancher1/tools/cache/reflector.go:168: failed to list *summary.SummarizedObject: Get "https://127.0.0.1:6443/apis/cert-manager.io/v1/certificates?resourceVersion=160294": net/http: TLS handshake timeout
W0116 11:13:12.132851 46860 reflector.go:325] pkg/mod/github.com/rancher/client-go@v1.24.0-rancher1/tools/cache/reflector.go:168: failed to list *summary.SummarizedObject: Get "https://127.0.0.1:6443/apis/node.k8s.io/v1/runtimeclasses?resourceVersion=160231": net/http: TLS handshake timeout
I0116 11:13:12.132905 46860 trace.go:205] Trace[631373749]: "Reflector ListAndWatch" name:pkg/mod/github.com/rancher/client-go@v1.24.0-rancher1/tools/cache/reflector.go:168 (16-Jan-2023 11:13:02.130) (total time: 10002ms):
Trace[631373749]: ---"Objects listed" error:Get "https://127.0.0.1:6443/apis/node.k8s.io/v1/runtimeclasses?resourceVersion=160231": net/http: TLS handshake timeout 10002ms (11:13:12.132)
Trace[631373749]: [10.002143209s] [10.002143209s] END
2. k3s.log
E0117 04:26:35.226050 4290 reflector.go:140] k8s.io/client-go@v1.25.4-k3s1/tools/cache/reflector.go:169: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
W0117 04:26:36.046392 4290 reflector.go:424] k8s.io/client-go@v1.25.4-k3s1/tools/cache/reflector.go:169: failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
E0117 04:26:36.046516 4290 reflector.go:140] k8s.io/client-go@v1.25.4-k3s1/tools/cache/reflector.go:169: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
{"level":"warn","ts":"2023-01-17T04:26:36.183Z","logger":"etcd-client","caller":"v3@v3.5.3-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0x400167d880/kine.sock","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
E0117 04:26:36.183408 4290 controller.go:187] failed to update lease, error: Put "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/lima-rancher-desktop?timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0117 04:26:36.183651 4290 writers.go:118] apiserver was unable to write a JSON response: http: Handler timeout
E0117 04:26:36.185775 4290 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
I0117 04:26:36.185091 4290 trace.go:205] Trace[333656479]: "GuaranteedUpdate etcd3" audit-id:0a94d052-49c1-40c2-a1f3-8bdacccbd6e9,key:/leases/kube-node-lease/lima-rancher-desktop,type:*coordination.Lease (17-Jan-2023 04:26:26.184) (total time: 10000ms):
Trace[333656479]: ---"Txn call finished" err:context deadline exceeded 9999ms (04:26:36.185)
Trace[333656479]: [10.000193713s] [10.000193713s] END
E0117 04:26:36.197602 4290 finisher.go:175] FinishRequest: post-timeout activity - time-elapsed: 13.941958ms, panicked: false, err: context deadline exceeded, panic-reason: <nil>
E0117 04:26:36.196928 4290 writers.go:131] apiserver was unable to write a fallback JSON response: http: Handler timeout
I0117 04:26:36.199085 4290 trace.go:205] Trace[1183966381]: "Update" url:/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/lima-rancher-desktop,user-agent:k3s/v1.25.4+k3s1 (linux/arm64) kubernetes/0dc6333,audit-id:0a94d052-49c1-40c2-a1f3-8bdacccbd6e9,client:127.0.0.1,accept:application/vnd.kubernetes.protobuf,application/json,protocol:HTTP/2.0 (17-Jan-2023 04:26:26.183) (total time: 10015ms):
Trace[1183966381]: ---"Write to database call finished" len:509,err:Timeout: request did not complete within requested timeout - context deadline exceeded 9998ms (04:26:36.183)
Trace[1183966381]: [10.015928213s] [10.015928213s] END
E0117 04:26:36.199699 4290 timeout.go:141] post-timeout activity - time-elapsed: 16.136125ms, PUT "/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/lima-rancher-desktop" result: <nil>
Note we have been able to avoid this hanging issue by switching to the 9p
mount type in Lima. I'm not sure if it completely fixes it or makes it occur less often time will tell by our users. But my suggestion to others affected by this is to try the 9p
mount. Caveat though the 9p
mount does not support symlinks in volumes.
@ryancurrah how do you enable 9p
? I read about it here i.e.
On macOS an alternative file sharing mechanism using 9p instead of reverse-sshfs has been implemented. It is disabled by default. Talk to us on Slack if you want to help us testing it.
But wasn't able to find the specific on how to enable it.
I have the same problem.
In detail, I and co-worker had upgraded the macOS to 13.0 but become producing it. We upgrade to 13.1, his machine recovered, but my machine was not recovered.
Finally, I had recovered by switching mountType to the 9p
Docker container had run normally with pure Lima that installed by homebrew. But mountType is null.
@lakamsani edit this file and add entry the mountType
to top-level
~/Library/Application Support/rancher-desktop/lima/_config/override.yaml
I ran into the same issue too, when doing a "pnpm install" in a docker container after mounting a custom workdir into lima, on my macOS 13.1(intel). So I think this is not related to intel or M1. I can exactly reproduce this issue every time by using the same steps. And I also checked logs under rancher desktop, it seems no error(s) logged.
For me, it seems "hang" only occures when using default mountType(should be null, from ~/Library/Application Support/rancher-desktop/lima/0/lima.yaml), and run some npm install commadn inside a docker container with -v custom volmue mount. I also wrote a dockerfile to do almost the same thing to test but the problem disappered. Finally I changed lima mountType to 9p and everyting seems to be ok now.
After upgrading to Ventura 13.2 coming from 12.x. I never ran into this problem on 12.x
I'm running into the same issue. I'm doing a massive amount of file activity along with network inside a container. The IO get's hung, which then docker ps
becomes unresponsive. I try to quit the desktop which hangs, to get it to quit properly:
ps auxww |grep rancher | grep ssh |awk '{print $2}' | xargs kill
On restart, qemu looks like it comes up properly, but the docker socket is unresponsive still. A second quit and restart works fine. I guess I'll try the 9p thing. I don't have an override.yaml, so I'm assuming it should look like:
---
mountType: 9p
--- mountType: 9p
Answered my own question:
cat ~/"Library/Application Support/rancher-desktop/lima/_config/override.yaml"
---
mountType: 9p
ps auxww |grep rancher | grep ssh shows nothing now while using disk io
Hello, experiencing same issue, but on intel CPU and macOS Ventura....FYI
Hello, experiencing same issue, but on intel CPU and macOS Ventura....FYI
I should have clarified that, I’m on intel as well. The 9p made a huge difference.
Unfortunately for me the 9p caused other issues so it's unusable for me.
update: upgraded to Ventura 13.2 and don't have the "freezing" problem anymore without any override...
Meet the same hang problem on 13.2 on Intel mac, docker freezing, can't quick rancher-desktop.
Meet the same hang problem on 13.2 on Intel mac, docker freezing, can't quick rancher-desktop.
I’m a terminal do a ps and grep for rancher. You will see a bunch of ssh sessions kill them off and your rancher will become responsive. Once made change to 9p all these hang issues went away.
I’m a terminal do a ps and grep for rancher. You will see a bunch of ssh sessions kill them off and your rancher will become responsive. Once made change to 9p all these hang issues went away.
Thanks, after adding a new override.yaml
, it work for me!
cat ~/Library/Application\ Support/rancher-desktop/lima/_config/override.yaml
---
mountType: 9p
I have been experiencing a similar problem on and off for the past month or two. Was originally discussing in the rancher-desktop slack channel, but after finding this issue I believe it's the same as what I'm experiencing.
I find the bug to be easily reproducible in my case: Rancher Desktop: 1.8.1 macOS: Ventura 13.1 Container runtime: dockerd (moby) [I have not tested recently with containerd/nerdctl - will try this] Rancher kubernetes: disabled (doesn't matter; I've seen this issue with k8s enabled as well)
I get the same behavior as described above, existing containers freeze and virtually all commands hang (docker ps
, docker image ls
, rdctl shell
, nothing works except simple stuff like docker version
).
Here is what I can note about reproducing the problem (at least in my case):
docker run -it
) with a few env vars passed in (probably not relevant)About the suggested workaround:
mountType: 9p
workaround - it did successfully prevent the container runtime from hanging; however, it caused my terraform provider to fatally crash (everytime), so this method is unusable for me.Same here: Rancher Version: 1.9.1 Ventura 13.4.1 (c)
Likewise, Rancher Desktop randomly freezes for me, more often-than-not after I leave it running without use for a while, and most nerdctl nor rdctl commands will respond until I restart the application (tearing down the VM, etc.).
I'm currently on Rancher Desktop 1.9.1 & on macOS Ventura 13.5.1, running on Apple silicon (M2 Pro). I don't have Kubernetes enabled, and I'm using the containerd runtime, with VZ emulation (Rosetta support enabled) & virtiofs mounting (I did have other types of problems before when using 9p, mostly related to user mappings & permissions, so I'd like to avoid going back to that, and reverse-sshfs was unbearably slow!).
Let me know if you'd like me to gather any information when RD hangs, for debugging purposes. Thanks!
Same issue here. Exactly same environment as @juanpalaciosascend (but M1 pro)
Same for me, factory reset did fix it for me though.
Factory reset fixes because it probably sets back to QEMU, reverse-sshfs, ... but if you try to apply those settings mentioned (VZ, virtiofs, ...) back, probably problem will come back.
I've seen most of the problems I've been experiencing go away... I want to say entirely, but it might be still a little bit too early for that, when switching back to the dockerd (moby) runtime, away from containerd.
All other settings (e.g. VZ framework, Rosetta support enabled, virtiofs volumes, Kubernetes disabled, etc.) remain the same, so that leads me to believe the problem that's causing Rancher Desktop to freeze revolves around the use of containerd.
Same here
Rancher 1.10.0 M1 Ventura 13.5.2
same issue (1.10.0 / 13.5.2 / M1 Pro)
same issue here 1.10.0/ m1 pro/ sonoma 14.0
same issue 1.10.0/ 13.5.1 / m1 pro
same issue Ventura 13.6 / M1 Pro / 1.10.0 / VZ. However I issued the same problems in lima/colima, so the problem is not in Rancher itself
same isssue 13.4.1 / M1 Pro / 1.11.0
same issue with Ventura 13.5.2
rancher desktop 1.11.1, M1 Ventura 13.6.3
process kill suggested in https://github.com/rancher-sandbox/rancher-desktop/issues/3777#issuecomment-1428673201 helps so a newly started rancher desktop does not hang for some time (a few days).
After the rancher desktop start this is the disk usage, so the hang is probably not due to some limits exceeded.
docker system df
TYPE TOTAL ACTIVE SIZE RECLAIMABLE
Images 23 2 21.46GB 19.95GB (92%)
Containers 4 0 986B 986B (100%)
Local Volumes 4 2 0B 0B
Build Cache 1178 0 18.44GB 18.44GB
It seems like tracing docker with dtruss is not feasible without disabling SIP (system integrity protection) https://www.deepanseeralan.com/tech/fun-with-dtruss-macOS/
sudo dtruss $(which docker) system df
Password:
dtrace: system integrity protection is on, some features will not be available
dtrace: failed to execute /Users/user/.rd/bin/docker: Operation not permitted
cp -v $(which docker) /tmp
/Users/user/.rd/bin/docker -> /tmp/docker
codesign --remove-signature /tmp/docker
codesign --display --verbose=4 /tmp/docker
/tmp/docker: code object is not signed at all
sudo dtruss /tmp/docker system df
Password:
dtrace: system integrity protection is on, some features will not be available
dtrace: failed to execute /tmp/docker: Could not create symbolicator for task
Maybe someone manages to trace docker to get some more information about the hang, otherwise I'm afraid we won't progress on the issue.
I've had Rancher Desktop 1.12.0 installed since yesterday and haven't encountered the issue again (on MacOS Ventura 13.6.3).
With 1.11.1, I was encountering this issue pretty much immediately when using VSCode dev containers and the only "fix" was setting the mountType to 9p, which broke dev containers in other ways and made them equally unusable.
I'm experiencing the hanging issue with 1.12.1.
Still an issue with rancher-desktop 1.12.2.
An additional information may be that hanging happens possibly more often when emulating amd64 using export DOCKER_DEFAULT_PLATFORM=linux/amd64
Rancher 1.12.3, macOs Sonoma 14.3.1 and this is still hanging.
I already tried several configurations such as Emulation VZ enabling Rosetta support and Volume virtiofs, but no luck...
Any luck on this? Experienced it in Sonoma.
I am experiencing this as well. OS: Sonoma, Apple silicon. Rancher: 1.13.1
By the way, I fixed mine by setting emulation to VZ
in Sonoma
. (forgot to post it 😅 )
@vaniiiiiiii 's fix worked for me as well.
VZ
worked for me. M1 Sonoma
Let's keep this issue open for a little longer; if it doesn't work with QEMU then it is still a bug.
I still experience this, though more intermittently due to my attempted workarounds. It feels like a memory issue because the repro is hard to predict. After a fresh restart of Rancher (i.e. rdctl shutdown
, rdctl start
) it seems to work fine, but after some indeterminate amount of time, it will hang again.
Rancher 1.15.1 Default Hardware config: 5 GB, 2 CPUs Emulation: VZ w/ Rosetta enabled Container Engine: dockerd (moby) M3 Mac, Sonoma 14.6.1
I regularly pull multi-layer images like https://hub.docker.com/r/rstudio/rstudio-workbench with the --platform linux/amd64
flag. I thought I found a workaround by adding "max-concurrent-downloads": 1
to /etc/docker/daemon.json
via rdctl shell
, but that eventually failed as well.
I wrote this script to pull several images and then prune them.
#!/bin/bash
for version in 1.4.1717-3 \
2021.09.0-351.pro6 \
2021.09.1-372.pro1 \
2021.09.2-382.pro1 \
2022.02.0-443.pro2 \
2022.02.1-461.pro1 \
2022.02.2-485.pro2 \
2022.02.3-492.pro3 \
2022.07.0-548.pro5 \
2022.07.1-554.pro3 \
bionic-2022.07.2 \
bionic-2022.12.0 \
bionic-2023.03.0 \
jammy-2023.03.2 \
jammy-2023.03.1 \
jammy-2023.06.2 \
jammy-2023.06.1 \
jammy-2023.06.0 \
jammy-2023.09.1 \
jammy-2023.09.0 \
jammy-2023.12.1 \
jammy-2023.12.0 \
jammy-2024.04.2 \
jammy-2024.04.1 \
jammy-2024.04.0
do
docker pull --platform linux/amd64 rstudio/rstudio-workbench:$version
done
docker image prune -af
(If this script succeeds, try waiting several hours before re-running. I had to wait a day for it to repro.)
When the script fails, it will display the following:
Cannot connect to the Docker daemon at unix:///Users/jay/.rd/docker.sock. Is the docker daemon running?
Subsequent docker
commands will fail with the same message.
To recover Rancher, run rdctl shutdown
, wait for it to quit entirely before running rdctl start
(or opening via /Applications).
(Note: Activity Monitor will show "Virtual Machine Service for limactl.ventura" has consumed all the memory allotted to it in the Hardware Configuration.)
The above Cannot connect to the Docker daemon at unix:///Users/user/.rd/docker.sock. Is the docker daemon running?
may be a different issue.This one is about any docker
command hanging. In my case even the "Quit Rancher Desktop" UI option was unresponsive - it would not quit the rancher desktop, I waited more than 5 minutes.
On my side, I've not experienced hanging since March (with QEMU), and realized I'm still using rancher desktop 1.12.3, with regularly up to date MacOS Sonoma (now 14.6.1). There were other problems with docker build, like Debian amd64 emulation being being very slow (next build taking 20 minutes compared to less than a minute on native aarch64), which made me eventually increase Virtual Machine -> Hardware -> Memory (GB)
to 16
.
Since then, the hanging has not reappeared yet, and the build takes only a few minutes. I kept other options as before, 4 CPUs, Volumes -> Mount Type -> reverse-sshfs
, Emulation -> Virtual Machine Type -> QEMU
, and Container Engine -> dockerd(moby)
.
Actual Behavior
When running a docker command it will hang forever. Any subsequent commands to docker in another shell hang as well. Rebooting the laptop is required as Rancher Desktop becomes unusable.
Steps to Reproduce
One dev on a M1 Mac, running Ventura 13.1 can reproduce this issue consistently by building a
Dockerfile
in docker. We however are unable to reproduce the same issue on our laptops consistently. One of our team members reproducing it is using a M1 Mac as well.Create a Dockerfile
Build Dockerfile in docker
Result
The terminal just hangs.
Expected Behavior
Docker commands not to hang.
Additional Information
We have had our developers start using Rancher Desktop in November 2022. It was working good, no hanging issues reported. Once people started updating to Ventura at the beginning of the month (January) they started reporting these issues. We have one developer who is able to consistently reproduce the issue, some of us can only reproduce it intermittently. Seems to be most reproducible on M1 Mac though. We were also able to reproduce it with our security tools disabled.
We enabled debug logging from the Rancher Desktop Troubleshooting page and looked at all the logs,
lima
and rancher and did not see any glaring errors or warnings.If there is anything else we can provide to help this let me know.
Rancher Desktop Version
1.7.0
Rancher Desktop K8s Version
Disabled
Which container engine are you using?
moby (docker cli)
What operating system are you using?
macOS
Operating System / Build Version
Ventura 13.1
What CPU architecture are you using?
arm64 (Apple Silicon)
Linux only: what package format did you use to install Rancher Desktop?
None
Windows User Only
No response