rancher-sandbox / rancher-desktop

Container Management and Kubernetes on the Desktop
https://rancherdesktop.io
Apache License 2.0
5.94k stars 281 forks source link

MacOS DNS regression in 1.5.0 and 1.5.1 #2811

Open ryfow opened 2 years ago

ryfow commented 2 years ago

Actual Behavior

With Rancher Desktop 1.5.{0,1} on aarch64 MacOS, I'm seeing qemu-system-aarch64 hang for several minutes at a time in my development environment. The problem appears to be triggered by a process making bursty DNS requests for host.docker.internal. The same development environment works fine on Rancher Desktop 1.4.1.

Steps to Reproduce

This isn't how I found the problem, but I think it reproduces the same underlying issue.

  1. Install Rancher Desktop 1.5.1 and configure in Docker/moby mode.
  2. Run docker run --rm --name crashy-crashy -ti ubuntu:20.04 bash -c 'apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y dnsutils psmisc && while true ; do dig host.docker.internal ; done'
  3. Wait for the crashy-crashy container to start logging dig output
  4. Run the following command in another host terminal: docker exec -ti crashy-crashy bash -c "while true ; do killall dig ; sleep .1 ; done"
  5. Wait for a bit, you should see the dig output eventually stop, qemu-system-aarch64 will be running at 100% CPU on your MacOS host, and docker commands will no longer work.

Result

I'm seeing the Rancher Desktop qemu VM become unresponsive until I kill the qemu-system-aarch64 process and restart Rancher Desktop.

Expected Behavior

The Rancher Desktop VM should not hang.

Additional Information

No response

Rancher Desktop Version

1.5.1

Rancher Desktop K8s Version

N/A

Which container engine are you using?

moby (docker cli)

What operating system are you using?

macOS

Operating System / Build Version

MacOS Monterey 12.5.1

What CPU architecture are you using?

arm64 (Apple Silicon)

Linux only: what package format did you use to install Rancher Desktop?

N/A

Windows User Only

No response

ryfow commented 2 years ago

FWIW, I ran my reproduction steps on a second Macbook, and the see the same behavior.

Nino-K commented 1 year ago

FWIW, I ran my reproduction steps on a second Macbook, and they see the same behavior.

@ryfow is the second Macbook also a M1? or x86?

@jandubois do you think you can reproduce this on your M1 machine?

matsukaz commented 1 year ago

Hi, I'm facing to a similar issue in 1.7.0. It seems like Lima is stuck on the file descriptor limit, but I haven't found a way to solve it yet. This issue also occurs on x86 macOS.

Steps to Reproduce

  1. Login to Lima and keep running nslookup.
    $ rdctl shell
    lima-rancher-desktop:/Users/xxx$ while true; do nslookup www.google.co.jp; done
  2. On host OS, show a list of UDP open files that qemu-system-aarch64 handles.
    $ lsof -p $(pgrep qemu-system-aarch64) | grep "UDP"
    qemu-syst 6788 xxxx  119u  IPv4 0x2c6ecf140850ff5f         0t0                 UDP *:63544
    qemu-syst 6788 xxxx  120u  IPv4 0x2c6ecf140851762f         0t0                 UDP *:63398
  3. A number of UDP open files are keep increasing and after it reaches to FD=1024u, Lima get stuck.
    $ lsof -p $(pgrep qemu-system-aarch64) | grep "UDP"
    ...
    qemu-syst 6788 xxxx  1023u  IPv4 0x2c6ecf14085191bf         0t0                 UDP *:54486
    qemu-syst 6788 xxxx  1024u  IPv4 0x2c6ecf140852088f         0t0                 UDP *:62934
  4. If you wait exactly 4 minutes, all UDP open files get released and Lima starts running again.

Rancher Desktop Version

1.4.1, 1.6.2, 1.7.0

Rancher Desktop K8s Version

N/A

Which container engine are you using?

moby (docker cli)

Operating System / Build Version / CPU

MacOS Monterey 12.6 (M1 2020) MacOS Ventura 13.0.1 (Intel Core i5, 2019)

matsukaz commented 1 year ago

This Issue may be a problem about Alpine Linux. I tried it with Lima and got the same problem, also with Debian, but not with Ubuntu.

I used the following images.

Nino-K commented 1 year ago

@ryfow the issue has been addressed here: https://github.com/lima-vm/lima/issues/1285, therefore, it should be included in our upcoming release. Thank you again for reporting this.

ryfow commented 1 year ago

Awesome! Looking forward to upgrading from 1.4.1 :)

Nino-K commented 1 year ago

I'm going to close this since all the changes are in place now, @ryfow and @matsukaz please keep your eyes on our next release and give it a try. Feel free to re-open if you encounter anything additional. Thanks

ryfow commented 1 year ago

@Nino-K This appears to still be a problem with Rancher Desktop 1.8. I don't know for sure if the same thing is making my dev environment hang, but I think it's the most likely suspect.

Edit: I can't figure out how to reopen.

matsukaz commented 1 year ago

@Nino-K At least in my environment, this issue was resolved with Rancher Desktop 1.8! I have not seen this issue since I upgraded to 1.8, even with the reproduction procedure I posted earlier.

@ryfow I'm not sure but t maybe an another problem.

ryfow commented 1 year ago

I tried my original reproduction steps with 1.8.1 on a work M1 Macbook and a personal M1 Macbook. It's hangs on both and puts qemu into 100% CPU usage.

Nino-K commented 1 year ago

@ryfow could your issue possibly be related to this one? https://github.com/lima-vm/lima/issues/1333

ryfow commented 1 year ago

@Nino-K I don't think it's https://github.com/lima-vm/lima/issues/1333. That bug appears to be talking about Virtualization.Framework. Looks like Rancher Desktop uses qemu.

I tried to follow my reproduction steps on lima 0.15, qemu 7.2.1 and limactl start --name docker template:///docker. I couldn't reproduce, the hang did not happen.

The qemu version is different, so I tried copying my system version of qemu-system-aarch64 (7.2.1) into the "Rancher Desktop.app" but that did not help. I still see the hang on Rancher Desktop with the new qemu.

ryfow commented 1 year ago

It's got to be a problem with https://github.com/lima-vm/alpine-lima. When I start a VM with limactl start --name alpine template://alpine the problem reproduces.

ryfow commented 9 months ago

As an FYI to anyone else running into this, I've had good results with using the VZ Virtual Machine Type. Things seem way more stable.