Open ReasonableGoose opened 2 years ago
@ReasonableGoose Thanks for bringing this to our attention. When you say:
This behavior is not consistent, sometimes rancher installs and opens without issues regardless of macOS version. Is that within 1.4.x/1.5.0? You said that this worked fine within 1.2.x so I just want to clarify that statement.
Otherwise, with 1.5.1, I suspect that we didn't touch the underlying code there but if you have time, can you confirm if you get into this state with our newest version?
I am also seeing this same issue on a "security hardened via corporate policies" macOS Monterey 12.5 (intel) using the latest Rancher Desktop 1.5.1.
lima.log (with debugging turned on)
I'll add that I also use Rancher Desktop 1.5.1 on a more out of the box macOS Monterey 12.5 (arm) without any issues at all.
@gaktive Yes, this is within 1.4.x/1.5.0. Some of us had issues with 1.4.x/1.5.0 and some of us didn't. 1.2.x worked for us in cases where 1.4.x/1.5.0 versions weren't working. We are noticing the same issues with 1.5.1 after testing again.
The common thread we are noticing seems to be related to what @amartin120 posted. The systems where this issue occurs are also security hardened.
@gaktive @ReasonableGoose It appears that, at least for me, that the issue is caused by a security hardening script that modifies a setting in /etc/ssh/sshd_config
to:
PermitRootLogin no
Doing so seems to break Rancher Desktop on a factory reset or fresh install with the errors described above. Reverting or commenting out that setting along with a factory reset of Rancher Desktop, allowed the startup to complete normally again.
Is it possible to work around this in the instances where someone is not allowed to simply re-adjust that setting in /etc/ssh/sshd_config
?
I am also experiencing this issue - at least I am seeing very similar behaviour. My system does not have the "PermitRootLogin no" in sshd_config though.
I have build from source and have been spending time trying to debug but it is proving difficult to figure out why this is occurring. It is not completely consistent on every run which commands fail - several different limactl commands will fail but also many (most) do succeed. I am also able to run the commands that fail from the cmd line and they succeed. I have seen even very simple commands like limactl list --json fail with EACCES from time to time.
I unfortunately cannot run dtruss on my machine since I am unable to disable SIP in my environment, which would be the logical next step to debug.
I have managed to fix this issue on my own machine by making two changes to lima.ts, which removes the use of Promise.all in a couple of places and effectively ensures serial execution of the limactl commands. If anyone else with this issue is building from source could try this change to see if it resolves the issue it would be helpful.
I am not completely clear why the concurrent execution of limactl is a problem - I can see that the commands should in theory be able to be run in parallel since they don't really interfere with each other. I checked that I am not running out of file descriptors and that sshd on the VM can accept plenty of sessions (I configured 50 to be sure).
We were able to resolve our issues with the same solution that @amartin120 suggested for others using RD on our team. We did have one RD user who did not have "PermitRootLogin no" in sshd_config and was still having problems. We noticed that they had an old macos profile that others on the team didn't have, so we removed it. Unfortunately, we didn't review the settings applied by that profile before removing it, but it did resolve the issues with initializing RD.
OK, that is useful to know. I presume that that for those who had the issue with sshd_config the symptom was that no command in the installation that used sudo on the VM would work?
I am trying to reproduce the behaviour I am seeing by writing a simple node app that all invokes limactl in similar ways to rancher but I haven't managed to get that to break yet.
I cannot reproduce this problem by just adding "PermitRootLogin no" to sshd_config
. I also don't understand why this would affect ssh connections to the VM; that should only affect ssh connections to the host.
Furthermore, all limactl
commands connect using a regular user account; the root
user inside the VM doesn't even have a ~/.ssh
directory (and therefore no authorized_keys
):
$ rdctl shell sudo -i ls -a
. .. .ash_history .docker
Yes, the behaviour I am seeing is definitely not related to ssh or indeed to acquiring root on the VM via sudo. To recap what I see:
I can't spot anything wrong with the code in lima.ts using Promise.all (although I don't think it achieves any material time saving on startup) - my guess is that there is a race condition in a lower level dependency that is being provoked on certain machines. I don't have any other theories that would explain it. I cannot run dtruss on the machine that exhibits the behaviour which is making it hard to diagnose further.
I am trying to reproduce in a standalone node.js app that takes a lot of the code from lima.ts. My hope is that if I can do an intensive test on another machine and reproduce it I will be able to get some dtruss output to narrow down what could be going on.
Other suggestions most welcome.
I can confirm that version 1.4.x and 1.5.x fail with this error for me on Mac 12.5.1 (M1). From the log, all sudo commands fail (example below)
2022-08-29T18:14:26.306Z: Using 192.168.178.51 on bridged network rd0
2022-08-29T18:14:27.798Z: + limactl shell --workdir=. 0 sudo mv ./trivy /usr/local/bin/trivy
2022-08-29T18:14:27.798Z: Error: spawn /Applications/Rancher Desktop.app/Contents/Resources/resources/darwin/lima/bin/limactl EACCES
2022-08-29T18:14:27.798Z: Error starting lima: Error: spawn /Applications/Rancher Desktop.app/Contents/Resources/resources/darwin/lima/bin/limactl EACCES
at Process.ChildProcess._handle.onexit (node:internal/child_process:282:19)
at onErrorNT (node:internal/child_process:477:16)
at processTicksAndRejections (node:internal/process/task_queues:83:21) {
errno: -13,
code: 'EACCES',
syscall: 'spawn /Applications/Rancher Desktop.app/Contents/Resources/resources/darwin/lima/bin/limactl',
path: '/Applications/Rancher Desktop.app/Contents/Resources/resources/darwin/lima/bin/limactl',
spawnargs: [
'shell',
'--workdir=.',
'0',
'sudo',
'mv',
'./trivy',
'/usr/local/bin/trivy'
]
}
Downgrading to 1.2.x works.
@jibiabraham-dh are you able to build from source? Just wondering if the change I made fixes it for you.
I am on Intel Mac - core i9. Also 12.5.1.
@rgreig Absolutely unbelievable, but this solved my problem as well. I don't really have experience in the codebase but are there processes in the Promise.all that are dependent on one another?
@soundweaverz I am now quite familiar with the codebase (!) at least all the stuff in lima.ts and they are not dependent on each other - i.e. as far as I can see await is used correctly where ordering needs to be enforced.
I have been building a standalone node app that takes the code and dependencies from lima.ts in an attempt to reproduce this issue. Right now I have not managed to reproduce it and I have got quite a lot of the VM setup process running (including obviously the stuff inside Promise.all which was where I started). I will continue doing this since I hate not having a clear root cause! But I guess we could consider merging this change if it fixes what is a blocking issue for some people?
@rgreig Thank you so much for all your effort into tracking this down! I too hate applying a "fix" without understanding why the fix works. Always makes you wonder if similar issues are lurking in other areas of the codebase.
I agree that we should merge your fix if the actual root cause remains elusive. We are currently not planning another release until the end of the month, so there is still some time.
Please create a regular PR for your change when you give up debugging, or let us know if we should do it!
Thanks again!
@jandubois No problem, I am now intrigued by this issue. I will bear in mind the timeline for the next release and continue to debug for a bit longer. I am happy to raise a PR and I will update progress here.
Just to give a quick update on progress in case anyone has any suggestions. I had built a small node app that did everything that lima.ts did in terms of initialising and configuring the VM. Even in a soak test, that app did not reproduce the issue. I then decided to run the same app - unmodified - using the electron cli (in the same way that the rancher app does). That immediately reproduced the issue, and even creating a much simplified version of the functionality I was able to reproduce the problem. Removing Promise.all also "fixed" the issue. I tried updating to the latest version of Electron (20.x) and it made no difference.
This is good up to a point - it shows that the issue lies within electron's implementation of the node process API. I am now building electron from source so that I can instrument it further - but I am confident that this is ultimately an electron issue on MacOS.
Another user on Slack seems to have run into the same problem. Here's the thread - https://rancher-users.slack.com/archives/C0200L1N1MM/p1674003141290749
Here's the log as shared on the thread
2023-01-18T00:59:41.070Z: mainEvents settings-update: {"version":4,"kubernetes":{"version":"1.24.4","memoryInGB":24,"numberCPUs":8,"port":6443,"containerEngine":"moby","checkForExistingKimBuilder":false,"enabled":true,"WSLIntegrations":{},"options":{"traefik":false,"flannel":true},"suppressSudo":false,"hostResolver":true,"experimental":{"socketVMNet":false}},"portForwarding":{"includeKubernetesServices":true},"images":{"showAll":true,"namespace":"k8s.io"},"telemetry":false,"updater":true,"debug":true,"pathManagementStrategy":"rcfiles","containerEngine":{"imageAllowList":{"enabled":false,"locked":false,"patterns":[]}},"diagnostics":{"showMuted":false,"mutedChecks":{}}}
2023-01-18T00:59:41.210Z: openMain() webRoot: app://.
2023-01-18T00:59:41.289Z: createWindow() name: main url: app://./index.html
2023-01-18T00:59:42.008Z: Checking if credential helper osxkeychain is working...
2023-01-18T00:59:42.016Z: Credential helper "docker-credential-osxkeychain" is not functional: Error: spawn docker-credential-osxkeychain EACCES
2023-01-18T00:59:43.601Z: ipcMain: "k8s-state" triggered with arguments:
2023-01-18T00:59:44.218Z: ipcMain: "settings-read" triggered with arguments:
2023-01-18T00:59:44.218Z: ipcMain: "settings-read" triggered with arguments:
2023-01-18T00:59:44.218Z: event settings-read in main: {"version":4,"kubernetes":{"version":"1.24.4","memoryInGB":24,"numberCPUs":8,"port":6443,"containerEngine":"moby","checkForExistingKimBuilder":false,"enabled":true,"WSLIntegrations":{},"options":{"traefik":false,"flannel":true},"suppressSudo":false,"hostResolver":true,"experimental":{"socketVMNet":false}},"portForwarding":{"includeKubernetesServices":true},"images":{"showAll":true,"namespace":"k8s.io"},"telemetry":false,"updater":true,"debug":true,"pathManagementStrategy":"rcfiles","containerEngine":{"imageAllowList":{"enabled":false,"locked":false,"patterns":[]}},"diagnostics":{"showMuted":false,"mutedChecks":{}}}
2023-01-18T00:59:44.219Z: ipcMain: "get-app-version" triggered with arguments:
2023-01-18T01:00:08.090Z: Kubernetes was unable to start: Error: spawn /Applications/Rancher Desktop.app/Contents/Resources/resources/darwin/lima/bin/limactl EACCES
at Process.ChildProcess._handle.onexit (node:internal/child_process:282:19)
at onErrorNT (node:internal/child_process:477:16)
at processTicksAndRejections (node:internal/process/task_queues:83:21) {
errno: -13,
code: 'EACCES',
syscall: 'spawn /Applications/Rancher Desktop.app/Contents/Resources/resources/darwin/lima/bin/limactl',
path: '/Applications/Rancher Desktop.app/Contents/Resources/resources/darwin/lima/bin/limactl',
spawnargs: [
'--debug',
'shell',
'--workdir=.',
'0',
'sudo',
'rm',
'-f',
'/tmp/rd-nginx.conf-Jo5AdT.nginx.conf'
]
}
2023-01-18T01:00:09.099Z: openDialog() id: KubernetesError
2023-01-18T01:00:09.285Z: createWindow() name: KubernetesError url: app://./index.html#KubernetesError
2023-01-18T01:00:10.216Z: ipcMain: "k8s-state" triggered with arguments:
2023-01-18T01:00:10.410Z: ipcMain: "get-app-version" triggered with arguments:
2023-01-18T01:36:30.804Z: openMain() webRoot: app://.
2023-01-18T01:36:32.615Z: ipcMain: "show-logs" triggered with arguments:
Actual Behavior
While starting Rancher desktop with kubernetes enabled, we're seeing the following error:
We're also noticing that this issue exists for both 1.5.0 and 1.4.1. We're able to run 1.2.1 without issues.
This behavior is not consistent, sometimes rancher installs and opens without issues regardless of macOS version.
Steps to Reproduce
Install Rancher Desktop 1.5.0 or 1.4.1 on macOS catalina, big sur, or monterey and open it.
Result
lima.log
Expected Behavior
Rancher desktop starts without errors
Additional Information
Attempts to resolve:
Update file permissions
Update file ownership in the VM
Upgrade from Rancher Desktop 1.2.1 to Rancher Desktop 1.5.0
Update from Rancher Desktop 1.2.1 to Rancher Desktop 1.4.1
Attempt to restart kubernetes after failure occurs
Attempt to reset kuberenetes and restart after failure occurs
Attempt to restart Rancher Desktop 1.5.0 after failure occurs
Attempt to factory reset Rancher Desktop 1.5.0 and restart
Attempt to factory reset Rancher Desktop 1.5.0 and reinstall 1.5.0
Turn off and on again -- suddenly worked and then failed after restart
Attempt to install Rancher Desktop 1.5.0 on a new user profile
Rancher Desktop Version
1.5.0
Rancher Desktop K8s Version
1.24.3
Which container engine are you using?
moby (docker cli)
What operating system are you using?
macOS
Operating System / Build Version
macOS Monterey, macOS Big Sur, macOS Catalina
What CPU architecture are you using?
x64
Linux only: what package format did you use to install Rancher Desktop?
No response
Windows User Only
No response