rancher-sandbox / rancher-desktop

Container Management and Kubernetes on the Desktop
https://rancherdesktop.io
Apache License 2.0
6k stars 282 forks source link

Kubernetes was unable to start: did not receive an event with the \\"running\\" status #2872

Open nagyv opened 2 years ago

nagyv commented 2 years ago

Actual Behavior

The Kubernetes instance does not start

Steps to Reproduce

Just start Rancher Desktop.

I tried to factory reset and changing container runtimes. Neither helped.

I recently migrated from an Intel to M1 chip mac. On the first run, Rancher Desktop kindly notified me that I need the aarch64 version. I uninstalled the old version and installed the aarch64 version. I did the migration using the built-in migration wizard.

Result

backend log tails:

2022-09-07T18:18:16.624Z: mainEvents settings-update: {"version":4,"kubernetes":{"version":"1.24.4","memoryInGB":6,"numberCPUs":2,"port":6443,"containerEngine":"moby","checkForExistingKimBuilder":false,"enabled":true,"WSLIntegrations":{},"options":{"traefik":false,"flannel":true},"suppressSudo":false,"experimentalHostResolver":false,"hostResolver":true},"portForwarding":{"includeKubernetesServices":false},"images":{"showAll":true,"namespace":"k8s.io"},"telemetry":true,"updater":true,"debug":false,"pathManagementStrategy":"rcfiles"}
2022-09-07T18:28:21.935Z: Kubernetes was unable to start: Error: /Applications/Rancher Desktop.app/Contents/Resources/resources/darwin/lima/bin/limactl exited with code 1
    at ChildProcess.<anonymous> (/Applications/Rancher Desktop.app/Contents/Resources/app.asar/dist/app/background.js:17:142946)
    at ChildProcess.emit (node:events:390:28)
    at Process.ChildProcess._handle.onexit (node:internal/child_process:290:12) {
  stdout: '',
  stderr: 'time="2022-09-07T20:18:21+02:00" level=info msg="Using the existing instance \\"0\\""\n' +
    'time="2022-09-07T20:18:21+02:00" level=info msg="Starting switch daemon for \\"rancher-desktop-bridged_en0\\" network"\n' +
    'time="2022-09-07T20:18:21+02:00" level=info msg="Starting vmnet daemon for \\"rancher-desktop-bridged_en0\\" network"\n' +
    'time="2022-09-07T20:18:21+02:00" level=info msg="Starting vmnet daemon for \\"rancher-desktop-shared\\" network"\n' +
    'time="2022-09-07T20:18:22+02:00" level=warning msg="[hostagent] field `firmware.legacyBIOS` is not supported for architecture \\"aarch64\\", ignoring"\n' +
    'time="2022-09-07T20:18:22+02:00" level=info msg="[hostagent] Starting QEMU (hint: to watch the boot progress, see \\"/Users/nagyviktor/Library/Application Support/rancher-desktop/lima/0/serial.log\\")"\n' +
    'time="2022-09-07T20:18:22+02:00" level=info msg="SSH Local Port: 65521"\n' +
    'time="2022-09-07T20:18:22+02:00" level=info msg="[hostagent] Waiting for the essential requirement 1 of 5: \\"ssh\\""\n' +
    'time="2022-09-07T20:19:58+02:00" level=info msg="[hostagent] Waiting for the essential requirement 1 of 5: \\"ssh\\""\n' +
    'time="2022-09-07T20:21:30+02:00" level=info msg="[hostagent] Waiting for the essential requirement 1 of 5: \\"ssh\\""\n' +
    'time="2022-09-07T20:23:03+02:00" level=info msg="[hostagent] Waiting for the essential requirement 1 of 5: \\"ssh\\""\n' +
    'time="2022-09-07T20:24:35+02:00" level=info msg="[hostagent] Waiting for the essential requirement 1 of 5: \\"ssh\\""\n' +
    'time="2022-09-07T20:26:07+02:00" level=info msg="[hostagent] Waiting for the essential requirement 1 of 5: \\"ssh\\""\n' +
    'time="2022-09-07T20:27:39+02:00" level=info msg="[hostagent] Waiting for the essential requirement 1 of 5: \\"ssh\\""\n' +
    'time="2022-09-07T20:28:21+02:00" level=fatal msg="did not receive an event with the \\"running\\" status"\n',
  code: 1,
  command: [
    '/Applications/Rancher Desktop.app/Contents/Resources/resources/darwin/lima/bin/limactl',
    'start',
    '--tty=false',
    '0'
  ]
}

Expected Behavior

The Kubernetes instance starts

Additional Information

From the issue tracker, this seems to be similar to https://github.com/rancher-sandbox/rancher-desktop/issues/2535

My logs are as they are expected to be:

ls -l ~/Library/Logs/rancher-desktop
total 64
-rw-------  1 nagyviktor  staff  3092 Sze  7 20:28 background.log
-rw-------  1 nagyviktor  staff     0 Sze  7 20:18 images.log
-rw-------  1 nagyviktor  staff     0 Sze  7 20:18 integrations.log
-rw-------  1 nagyviktor  staff   422 Sze  7 20:18 k8s.log
lrwxr-xr-x  1 nagyviktor  staff    62 Feb 13  2022 lima.ha.stderr.log -> ../../Application Support/rancher-desktop/lima/0/ha.stderr.log
lrwxr-xr-x  1 nagyviktor  staff    62 Feb 13  2022 lima.ha.stdout.log -> ../../Application Support/rancher-desktop/lima/0/ha.stdout.log
-rw-------  1 nagyviktor  staff  6600 Sze  7 20:28 lima.log
lrwxr-xr-x  1 nagyviktor  staff    59 Feb 13  2022 lima.serial.log -> ../../Application Support/rancher-desktop/lima/0/serial.log
-rw-------  1 nagyviktor  staff     0 Sze  7 20:18 mock.log
-rw-------  1 nagyviktor  staff  6625 Sze  7 20:18 networking.log
-rw-------  1 nagyviktor  staff   110 Sze  7 20:18 server.log
-rw-------  1 nagyviktor  staff     0 Sze  7 20:28 settings.log
-rw-------  1 nagyviktor  staff     0 Sze  7 20:18 steve.log
-rw-------  1 nagyviktor  staff   165 Sze  7 20:18 update.log
-rw-------  1 nagyviktor  staff     0 Sze  7 20:18 wsl.log

💾 Logs zipped and attached

Rancher Desktop Version

1.5.1

Rancher Desktop K8s Version

1.24.3

Which container engine are you using?

containerd (nerdctl)

What operating system are you using?

macOS

Operating System / Build Version

12.5.1

What CPU architecture are you using?

arm64 (Apple Silicon)

Linux only: what package format did you use to install Rancher Desktop?

No response

Windows User Only

No response

adamkpickering commented 2 years ago

Thanks for filing an issue!

Did you do a factory reset when you switched from the x86_64 version to the aarch64 version? I wonder if that may solve the problem. RD depends on a lot of things behind the scenes, and one of those may still be the x86_64 version.

Also, it looks like the following files in the logs you uploaded are symlinks:

lima.ha.stderr.log lima.ha.stdout.log lima.serial.log

Any chance you could reproduce the problem, and then upload the logs and include the non-symlink versions of those files?

nagyv commented 2 years ago

@adamkpickering Thanks for looking into the issue.

Did you do a factory reset when you switched from the x86_64 version to the aarch64 version?

Yes, I did.

Here are the symlinked logs.

more-logs.zip

adamkpickering commented 2 years ago

In ha.stderr.log, I see the same thing over and over:

{"level":"info","msg":"Waiting for the essential requirement 1 of 5: \"ssh\"","time":"2022-09-07T20:41:18+02:00"}
{"level":"debug","msg":"executing script \"ssh\"","time":"2022-09-07T20:41:19+02:00"}
{"level":"debug","msg":"executing ssh for script \"ssh\": /usr/bin/ssh [ssh -F /dev/null -o IdentityFile=\"/Users/nagyviktor/Library/Application Support/rancher-desktop/lima/_config/user\" -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o NoHostAuthenticationForLocalhost=yes -o GSSAPIAuthentication=no -o PreferredAuthentications=publickey -o Compression=no -o BatchMode=yes -o IdentitiesOnly=yes -o Ciphers=\"^aes128-gcm@openssh.com,aes256-gcm@openssh.com\" -o User=nagyviktor -o ControlMaster=auto -o ControlPath=\"/Users/nagyviktor/Library/Application Support/rancher-desktop/lima/0/ssh.sock\" -o ControlPersist=5m -p 65521 127.0.0.1 -- /bin/bash]","time":"2022-09-07T20:41:19+02:00"}
{"level":"debug","msg":"stdout=\"\", stderr=\"kex_exchange_identification: read: Connection reset by peer\\r\\nConnection reset by 127.0.0.1 port 65521\\r\\n\", err=failed to execute script \"ssh\": stdout=\"\", stderr=\"kex_exchange_identification: read: Connection reset by peer\\r\\nConnection reset by 127.0.0.1 port 65521\\r\\n\": exit status 255","time":"2022-09-07T20:42:39+02:00"}

I'd have to look into it deeper to know for sure, but it seems like every time the host agent attempts to establish an SSH connection to the VM, the connection gets reset by something along the path to the VM. Do you have any non-vanilla network configs? Firewalls, VPNs, company laptop, that sort of thing?

I wonder if this rings a bell for @jandubois...

nagyv commented 2 years ago

I'm not aware of any special network setups.

Adam Pickering @.***> ezt írta (időpont: 2022. szept. 12., Hét 19:30):

In ha.stderr.log, I see the same thing over and over:

{"level":"info","msg":"Waiting for the essential requirement 1 of 5: \"ssh\"","time":"2022-09-07T20:41:18+02:00"} {"level":"debug","msg":"executing script \"ssh\"","time":"2022-09-07T20:41:19+02:00"} {"level":"debug","msg":"executing ssh for script \"ssh\": /usr/bin/ssh [ssh -F /dev/null -o IdentityFile=\"/Users/nagyviktor/Library/Application Support/rancher-desktop/lima/_config/user\" -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o NoHostAuthenticationForLocalhost=yes -o GSSAPIAuthentication=no -o PreferredAuthentications=publickey -o Compression=no -o BatchMode=yes -o IdentitiesOnly=yes -o @.**@.\" -o User=nagyviktor -o ControlMaster=auto -o ControlPath=\"/Users/nagyviktor/Library/Application Support/rancher-desktop/lima/0/ssh.sock\" -o ControlPersist=5m -p 65521 127.0.0.1 -- /bin/bash]","time":"2022-09-07T20:41:19+02:00"} {"level":"debug","msg":"stdout=\"\", stderr=\"kex_exchange_identification: read: Connection reset by peer\r\nConnection reset by 127.0.0.1 port 65521\r\n\", err=failed to execute script \"ssh\": stdout=\"\", stderr=\"kex_exchange_identification: read: Connection reset by peer\r\nConnection reset by 127.0.0.1 port 65521\r\n\": exit status 255","time":"2022-09-07T20:42:39+02:00"}

I'd have to look into it deeper to know for sure, but it seems like every time the host agent attempts to establish an SSH connection to the VM, the connection gets reset by something along the path to the VM. Do you have any non-vanilla network configs? Firewalls, VPNs, company laptop, that sort of thing?

I wonder if this rings a bell for @jandubois https://github.com/jandubois...

— Reply to this email directly, view it on GitHub https://github.com/rancher-sandbox/rancher-desktop/issues/2872#issuecomment-1244068645, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA65T5GYWKXIR5XXF4GM63V55SDJANCNFSM6AAAAAAQHAXV7U . You are receiving this because you authored the thread.Message ID: @.***>

adamkpickering commented 2 years ago

It's possible that it's a problem with lima or QEMU, and not with the network... are you able to use lima successfully?

nagyv commented 2 years ago

It's possible that it's a problem with lima or QEMU, and not with the network... are you able to use lima successfully?

I don't know. :)

Is lima expected to be in the $PATH without any additional setup outside of installing Rancher Desktop?

$ lima uname -a
zsh: command not found: lima
jandubois commented 2 years ago

Is lima expected to be in the $PATH without any additional setup outside of installing Rancher Desktop?

No, it isn't, and you would have to set LIMA_HOME correctly before it would work. Ideally you should never run it manually.

From lima.serial.log I can see that qemu can't boot the VM at all, which indicates that the basedisk is not working at all. It should be a copy of the ISO image:

jan@zilicon rancher-desktop % ls -l /Applications/Rancher\ Desktop.app/Contents/Resources/resources/darwin/alp*
-rw-r--r--  1 jan  admin  201990144 16 Sep 13:44 /Applications/Rancher Desktop.app/Contents/Resources/resources/darwin/alpine-lima-v0.2.21-rd-3.16.0.iso
jan@zilicon rancher-desktop % ls -l ~/Library/Application\ Support/rancher-desktop/lima/0/basedisk
-rw-r--r--  1 jan  staff  201990144 16 Sep 17:21 /Users/jan/Library/Application Support/rancher-desktop/lima/0/basedisk

I suspect that something got broken during your migration and would recommend to do another Factory Reset, then remove Rancher Desktop and install it again.

If it still fails, attach the new logs (including symlinked ones) and show the content of the 2 directories above.

codeninja55 commented 1 year ago

I'm basically having the same problem as the original post with similar hardware but on Ventura 13.2.1 and Rancher Desktop 1.6.2. Has anybody found a solution for this? Logs attached.

No special configurations for SSH as this is a personal laptop.

rancher-desktop-1.6.2-logs.zip

wibed commented 11 months ago

me too, on rancher desktop 1.9.1 up to 1.11, migrated from a timemachine backup. multiple factory resets, downgrades and reboots later... the firewall seems to be turned off.

i cant pinpoint the source of the problem.

has anyone had success getting this to work?

i had to go back to version 1.8 for it to work it worked upgrading from 1.8 to 1.9. but i couldnt get 1.9 to work from the installer.

these might be related. https://github.com/docker/compose/issues/8191

the error i get after installing brew install socket_vmnet

containerd kex_exchange_identification: read: Connection reset by peer

seems to be something wrong with the ssh settings. i gave rancher desktop administrative privileges and full disk access. still had to regress like 5 versions

it seems a similar issue to what occurs here is the case: https://github.com/rancher-sandbox/rancher-desktop/issues/2072

the ssh.sock file is not created within ~/Library/Application\ Support/rancher-desktop/lima/0/ i did try export lima home

export LIMA_HOME=`~/Library/Application\ Support/rancher-desktop/lima`
gauravingalkarext commented 9 months ago

facing the same issue. firewall is not an issue. seems like something wrong with the ssh configurations.

Trying to replicate the ssh command using config from error logs, I get the same connection reset error. is there a way to login into target vm where ssh server is running to validate ssh configurations ?

➜ ~ ssh -vvv -p 51074 -i xxx/rancher-desktop/lima/_config/user xxx@127.0.0.1 OpenSSH_9.4p1, LibreSSL 3.3.6 debug1: Reading configuration data /etc/ssh/ssh_config debug1: /etc/ssh/ssh_config line 21: include /etc/ssh/ssh_config.d/ matched no files debug1: /etc/ssh/ssh_config line 54: Applying options for debug2: resolve_canonicalize: hostname 127.0.0.1 is address debug3: expanded UserKnownHostsFile '~/.ssh/known_hosts' -> '/Users/xxx/.ssh/known_hosts' debug3: expanded UserKnownHostsFile '~/.ssh/known_hosts2' -> '/Users/xxx/.ssh/known_hosts2' debug1: Authenticator provider $SSH_SK_PROVIDER did not resolve; disabling debug3: ssh_connect_direct: entering debug1: Connecting to 127.0.0.1 [127.0.0.1] port 51074. debug3: set_sock_tos: set socket 3 IP_TOS 0x48 debug1: Connection established. debug1: identity file /Users/xxx/Library/Application Support/rancher-desktop/lima/_config/user type 3 debug1: identity file /Users/xxx/Library/Application Support/rancher-desktop/lima/_config/user-cert type -1 debug1: Local version string SSH-2.0-OpenSSH_9.4 kex_exchange_identification: read: Connection reset by peer Connection reset by 127.0.0.1 port 51074

spaceartichoke commented 9 months ago

Same issue with a fresh installation of 1.12.2 on a brand new M3 Mac. Lima stuck "Waiting for the essential requirement 1 of 5" until timeout.

Had to install 1.8 for successful initialization.

LB50000 commented 8 months ago

Same issue with a fresh installation of 1.12.2 on a brand new M3 Mac. Lima stuck "Waiting for the essential requirement 1 of 5" until timeout.

Had to install 1.8 for successful initialization.

I had the same issue. Installing 1.8 results in a successful initialisation. Upgrading to 1.12.3 then results in a broken Kubernetes again.

spaceartichoke commented 4 months ago

Hi @jandubois is there a plan to address this issue please? Although working on 1.8 is a valid workaround for now, I would like to move to newer versions. As stated above, I had this issue from the very beginning on a new M3 with a fresh installation of 1.12.2. Only factory reset and moving back to 1.8 resulted in successful initialization. This seems like a wider issue on the new Macs rather than an isolated incident.