Issues initializing Rancher Desktop due to permission issues

ReasonableGoose commented 2 years ago

Actual Behavior

While starting Rancher desktop with kubernetes enabled, we're seeing the following error:

Error: spawn /Applications/Rancher Desktop.app/Contents/Resources/resources/darwin/lima/bin/limactl EACCES

    Installing trivy & CA certs

    'shell',
        '--workdir=.',
        '0',
        'sudo',
        'mv',
        './trivy',
        '/usr/local/bin/trivy'
    ]
    }

We're also noticing that this issue exists for both 1.5.0 and 1.4.1. We're able to run 1.2.1 without issues.

This behavior is not consistent, sometimes rancher installs and opens without issues regardless of macOS version.

Steps to Reproduce

Install Rancher Desktop 1.5.0 or 1.4.1 on macOS catalina, big sur, or monterey and open it.

Result

lima.log

Expected Behavior

Rancher desktop starts without errors

Additional Information

Attempts to resolve:

Check for quarantined files for Rancher\ Desktop.app


> xattr /Application/Rancher\ Desktop.app/
com.apple.quarantine
> xattr -r -d com.apple.quarantine /Application/Rancher\ Desktop.app/
> xattr /Application/Rancher\ Desktop.app/

- Restart Kubernetes

2. Make the directory that complains about permissions issues with
```sh
> rdctl shell
> sudo mkdir -p /etc/rancher/desktop

Restart kubernetes

Update file permissions

> chmod -R 755 /Applications/Rancher\ Desktop.app

Update file ownership in the VM
```
chown <host_user>:admin <path>/<to>/trivy
```
- Restart Kubernetes We didn't get the same error after attempting this, but did get a different error which I cannot find now.
Upgrade from Rancher Desktop 1.2.1 to Rancher Desktop 1.5.0
- Install RD 1.2.1, let automatic upgrade run
Update from Rancher Desktop 1.2.1 to Rancher Desktop 1.4.1
- Install RD 1.2.1, manually install 1.4.1
Attempt to restart kubernetes after failure occurs
Attempt to reset kuberenetes and restart after failure occurs
Attempt to restart Rancher Desktop 1.5.0 after failure occurs
Attempt to factory reset Rancher Desktop 1.5.0 and restart
Attempt to factory reset Rancher Desktop 1.5.0 and reinstall 1.5.0
Turn off and on again -- suddenly worked and then failed after restart
Attempt to install Rancher Desktop 1.5.0 on a new user profile

Rancher Desktop Version

1.5.0

Rancher Desktop K8s Version

1.24.3

Which container engine are you using?

moby (docker cli)

What operating system are you using?

macOS

Operating System / Build Version

macOS Monterey, macOS Big Sur, macOS Catalina

What CPU architecture are you using?

x64

Linux only: what package format did you use to install Rancher Desktop?

No response

Windows User Only

No response

gaktive commented 2 years ago

@ReasonableGoose Thanks for bringing this to our attention. When you say:

This behavior is not consistent, sometimes rancher installs and opens without issues regardless of macOS version. Is that within 1.4.x/1.5.0? You said that this worked fine within 1.2.x so I just want to clarify that statement.

Otherwise, with 1.5.1, I suspect that we didn't touch the underlying code there but if you have time, can you confirm if you get into this state with our newest version?

amartin120 commented 2 years ago

I am also seeing this same issue on a "security hardened via corporate policies" macOS Monterey 12.5 (intel) using the latest Rancher Desktop 1.5.1.

lima.log (with debugging turned on)

I'll add that I also use Rancher Desktop 1.5.1 on a more out of the box macOS Monterey 12.5 (arm) without any issues at all.

ReasonableGoose commented 2 years ago

@gaktive Yes, this is within 1.4.x/1.5.0. Some of us had issues with 1.4.x/1.5.0 and some of us didn't. 1.2.x worked for us in cases where 1.4.x/1.5.0 versions weren't working. We are noticing the same issues with 1.5.1 after testing again.

The common thread we are noticing seems to be related to what @amartin120 posted. The systems where this issue occurs are also security hardened.

amartin120 commented 2 years ago

@gaktive @ReasonableGoose It appears that, at least for me, that the issue is caused by a security hardening script that modifies a setting in /etc/ssh/sshd_config to:

PermitRootLogin no

Doing so seems to break Rancher Desktop on a factory reset or fresh install with the errors described above. Reverting or commenting out that setting along with a factory reset of Rancher Desktop, allowed the startup to complete normally again.

Is it possible to work around this in the instances where someone is not allowed to simply re-adjust that setting in /etc/ssh/sshd_config?

rgreig commented 2 years ago

I am also experiencing this issue - at least I am seeing very similar behaviour. My system does not have the "PermitRootLogin no" in sshd_config though.

I have build from source and have been spending time trying to debug but it is proving difficult to figure out why this is occurring. It is not completely consistent on every run which commands fail - several different limactl commands will fail but also many (most) do succeed. I am also able to run the commands that fail from the cmd line and they succeed. I have seen even very simple commands like limactl list --json fail with EACCES from time to time.

I unfortunately cannot run dtruss on my machine since I am unable to disable SIP in my environment, which would be the logical next step to debug.

rgreig commented 2 years ago

I have managed to fix this issue on my own machine by making two changes to lima.ts, which removes the use of Promise.all in a couple of places and effectively ensures serial execution of the limactl commands. If anyone else with this issue is building from source could try this change to see if it resolves the issue it would be helpful.

I am not completely clear why the concurrent execution of limactl is a problem - I can see that the commands should in theory be able to be run in parallel since they don't really interfere with each other. I checked that I am not running out of file descriptors and that sshd on the VM can accept plenty of sessions (I configured 50 to be sure).

lima.ts.patch.txt

ReasonableGoose commented 2 years ago

We were able to resolve our issues with the same solution that @amartin120 suggested for others using RD on our team. We did have one RD user who did not have "PermitRootLogin no" in sshd_config and was still having problems. We noticed that they had an old macos profile that others on the team didn't have, so we removed it. Unfortunately, we didn't review the settings applied by that profile before removing it, but it did resolve the issues with initializing RD.

rgreig commented 2 years ago

OK, that is useful to know. I presume that that for those who had the issue with sshd_config the symptom was that no command in the installation that used sudo on the VM would work?

I am trying to reproduce the behaviour I am seeing by writing a simple node app that all invokes limactl in similar ways to rancher but I haven't managed to get that to break yet.

jandubois commented 2 years ago

I cannot reproduce this problem by just adding "PermitRootLogin no" to sshd_config. I also don't understand why this would affect ssh connections to the VM; that should only affect ssh connections to the host.

Furthermore, all limactl commands connect using a regular user account; the root user inside the VM doesn't even have a ~/.ssh directory (and therefore no authorized_keys):

$ rdctl shell sudo -i ls -a
.             ..            .ash_history  .docker

rgreig commented 2 years ago

Yes, the behaviour I am seeing is definitely not related to ssh or indeed to acquiring root on the VM via sudo. To recap what I see:

Random failures of commands that are executed within Promise.all in lima.ts. All fail with EACCES errors, as per the original poster's log attachment
It fails roughly 8 or 9 times out of 10 on my machine
Eliminating the Promise.all (and just running all the commands serially), using the patch file I shared in an earlier comment, means success 100% of the time.
Executing the failed commands immediately from an interactive shell always succeeds - there is definitely no permission error or issue with the files that have been created.
I have tried on other machines but not one with exactly the same spec as mine and not been able to reproduce yet

I can't spot anything wrong with the code in lima.ts using Promise.all (although I don't think it achieves any material time saving on startup) - my guess is that there is a race condition in a lower level dependency that is being provoked on certain machines. I don't have any other theories that would explain it. I cannot run dtruss on the machine that exhibits the behaviour which is making it hard to diagnose further.

I am trying to reproduce in a standalone node.js app that takes a lot of the code from lima.ts. My hope is that if I can do an intensive test on another machine and reproduce it I will be able to get some dtruss output to narrow down what could be going on.

Other suggestions most welcome.

jibiabraham-dh commented 2 years ago

I can confirm that version 1.4.x and 1.5.x fail with this error for me on Mac 12.5.1 (M1). From the log, all sudo commands fail (example below)

2022-08-29T18:14:26.306Z: Using 192.168.178.51 on bridged network rd0
2022-08-29T18:14:27.798Z: + limactl shell --workdir=. 0 sudo mv ./trivy /usr/local/bin/trivy
2022-08-29T18:14:27.798Z: Error: spawn /Applications/Rancher Desktop.app/Contents/Resources/resources/darwin/lima/bin/limactl EACCES
2022-08-29T18:14:27.798Z: Error starting lima: Error: spawn /Applications/Rancher Desktop.app/Contents/Resources/resources/darwin/lima/bin/limactl EACCES
    at Process.ChildProcess._handle.onexit (node:internal/child_process:282:19)
    at onErrorNT (node:internal/child_process:477:16)
    at processTicksAndRejections (node:internal/process/task_queues:83:21) {
  errno: -13,
  code: 'EACCES',
  syscall: 'spawn /Applications/Rancher Desktop.app/Contents/Resources/resources/darwin/lima/bin/limactl',
  path: '/Applications/Rancher Desktop.app/Contents/Resources/resources/darwin/lima/bin/limactl',
  spawnargs: [
    'shell',
    '--workdir=.',
    '0',
    'sudo',
    'mv',
    './trivy',
    '/usr/local/bin/trivy'
  ]
}

Downgrading to 1.2.x works.

rgreig commented 2 years ago

@jibiabraham-dh are you able to build from source? Just wondering if the change I made fixes it for you.

I am on Intel Mac - core i9. Also 12.5.1.

soundweaverz commented 2 years ago

@rgreig Absolutely unbelievable, but this solved my problem as well. I don't really have experience in the codebase but are there processes in the Promise.all that are dependent on one another?

rgreig commented 2 years ago

@soundweaverz I am now quite familiar with the codebase (!) at least all the stuff in lima.ts and they are not dependent on each other - i.e. as far as I can see await is used correctly where ordering needs to be enforced.

I have been building a standalone node app that takes the code and dependencies from lima.ts in an attempt to reproduce this issue. Right now I have not managed to reproduce it and I have got quite a lot of the VM setup process running (including obviously the stuff inside Promise.all which was where I started). I will continue doing this since I hate not having a clear root cause! But I guess we could consider merging this change if it fixes what is a blocking issue for some people?

jandubois commented 2 years ago

@rgreig Thank you so much for all your effort into tracking this down! I too hate applying a "fix" without understanding why the fix works. Always makes you wonder if similar issues are lurking in other areas of the codebase.

I agree that we should merge your fix if the actual root cause remains elusive. We are currently not planning another release until the end of the month, so there is still some time.

Please create a regular PR for your change when you give up debugging, or let us know if we should do it!

Thanks again!

rgreig commented 2 years ago

@jandubois No problem, I am now intrigued by this issue. I will bear in mind the timeline for the next release and continue to debug for a bit longer. I am happy to raise a PR and I will update progress here.

rgreig commented 2 years ago

Just to give a quick update on progress in case anyone has any suggestions. I had built a small node app that did everything that lima.ts did in terms of initialising and configuring the VM. Even in a soak test, that app did not reproduce the issue. I then decided to run the same app - unmodified - using the electron cli (in the same way that the rancher app does). That immediately reproduced the issue, and even creating a much simplified version of the functionality I was able to reproduce the problem. Removing Promise.all also "fixed" the issue. I tried updating to the latest version of Electron (20.x) and it made no difference.

This is good up to a point - it shows that the issue lies within electron's implementation of the node process API. I am now building electron from source so that I can instrument it further - but I am confident that this is ultimately an electron issue on MacOS.

gunamata commented 1 year ago

Another user on Slack seems to have run into the same problem. Here's the thread - https://rancher-users.slack.com/archives/C0200L1N1MM/p1674003141290749

Here's the log as shared on the thread

2023-01-18T00:59:41.070Z: mainEvents settings-update: {"version":4,"kubernetes":{"version":"1.24.4","memoryInGB":24,"numberCPUs":8,"port":6443,"containerEngine":"moby","checkForExistingKimBuilder":false,"enabled":true,"WSLIntegrations":{},"options":{"traefik":false,"flannel":true},"suppressSudo":false,"hostResolver":true,"experimental":{"socketVMNet":false}},"portForwarding":{"includeKubernetesServices":true},"images":{"showAll":true,"namespace":"k8s.io"},"telemetry":false,"updater":true,"debug":true,"pathManagementStrategy":"rcfiles","containerEngine":{"imageAllowList":{"enabled":false,"locked":false,"patterns":[]}},"diagnostics":{"showMuted":false,"mutedChecks":{}}}
2023-01-18T00:59:41.210Z: openMain() webRoot: app://.
2023-01-18T00:59:41.289Z: createWindow() name: main  url: app://./index.html
2023-01-18T00:59:42.008Z: Checking if credential helper osxkeychain is working...
2023-01-18T00:59:42.016Z: Credential helper "docker-credential-osxkeychain" is not functional: Error: spawn docker-credential-osxkeychain EACCES
2023-01-18T00:59:43.601Z: ipcMain: "k8s-state" triggered with arguments: 
2023-01-18T00:59:44.218Z: ipcMain: "settings-read" triggered with arguments: 
2023-01-18T00:59:44.218Z: ipcMain: "settings-read" triggered with arguments: 
2023-01-18T00:59:44.218Z: event settings-read in main: {"version":4,"kubernetes":{"version":"1.24.4","memoryInGB":24,"numberCPUs":8,"port":6443,"containerEngine":"moby","checkForExistingKimBuilder":false,"enabled":true,"WSLIntegrations":{},"options":{"traefik":false,"flannel":true},"suppressSudo":false,"hostResolver":true,"experimental":{"socketVMNet":false}},"portForwarding":{"includeKubernetesServices":true},"images":{"showAll":true,"namespace":"k8s.io"},"telemetry":false,"updater":true,"debug":true,"pathManagementStrategy":"rcfiles","containerEngine":{"imageAllowList":{"enabled":false,"locked":false,"patterns":[]}},"diagnostics":{"showMuted":false,"mutedChecks":{}}}
2023-01-18T00:59:44.219Z: ipcMain: "get-app-version" triggered with arguments: 
2023-01-18T01:00:08.090Z: Kubernetes was unable to start: Error: spawn /Applications/Rancher Desktop.app/Contents/Resources/resources/darwin/lima/bin/limactl EACCES
    at Process.ChildProcess._handle.onexit (node:internal/child_process:282:19)
    at onErrorNT (node:internal/child_process:477:16)
    at processTicksAndRejections (node:internal/process/task_queues:83:21) {
  errno: -13,
  code: 'EACCES',
  syscall: 'spawn /Applications/Rancher Desktop.app/Contents/Resources/resources/darwin/lima/bin/limactl',
  path: '/Applications/Rancher Desktop.app/Contents/Resources/resources/darwin/lima/bin/limactl',
  spawnargs: [
    '--debug',
    'shell',
    '--workdir=.',
    '0',
    'sudo',
    'rm',
    '-f',
    '/tmp/rd-nginx.conf-Jo5AdT.nginx.conf'
  ]
}
2023-01-18T01:00:09.099Z: openDialog() id:  KubernetesError
2023-01-18T01:00:09.285Z: createWindow() name: KubernetesError  url: app://./index.html#KubernetesError
2023-01-18T01:00:10.216Z: ipcMain: "k8s-state" triggered with arguments: 
2023-01-18T01:00:10.410Z: ipcMain: "get-app-version" triggered with arguments: 
2023-01-18T01:36:30.804Z: openMain() webRoot: app://.
2023-01-18T01:36:32.615Z: ipcMain: "show-logs" triggered with arguments:

rancher-sandbox / rancher-desktop