lima-vm / lima

Linux virtual machines, with a focus on running containers
https://lima-vm.io/
Apache License 2.0
14.92k stars 579 forks source link

Stuck waiting for ssh -- tcpproxy no route #1293

Open larsyencken opened 1 year ago

larsyencken commented 1 year ago

Description

Hello,

I'd been working with a VM and then lost access to it. Doing limactl stop <instance> && limactl start <instance> didn't seem to help, and now it just gets stuck indefinitely Waiting for the essential requirement 1 of 3: "ssh".

Interestingly, this line repeats:

INFO[0004] [hostagent] 2023/01/11 21:51:42 tcpproxy: for incoming conn 127.0.0.1:52962, error dialing "192.168.5.15:22": connect tcp 192.168.5.15:22: no route

Which seems to me like maybe there's a problem with the bridge interface.

The VM template I'm using looks like this:

# experimental settings
vmType: vz
mountType: virtiofs
networks:
- vzNAT: true

# make it a native M1 VM
arch: aarch64

# appropriate levels for the host machine
cpus: 8
memory: 16GiB
disk: 300GiB

...

# share a temp space to make sharing files with the host easier
mounts:
- location: "~"
- location: "/tmp/lima"
  writable: true

ssh:
  localPort: 60023
  forwardAgent: true

Logs look like this:

lars@juniper lima/_output (master) » limactl start alt
INFO[0000] Using the existing instance "alt"
INFO[0000] [hostagent] Starting VZ (hint: to watch the boot progress, see "/Users/lars/.lima/alt/serial.log")
INFO[0001] SSH Local Port: 60023
INFO[0001] [hostagent] new connection from  to
INFO[0001] [hostagent] [VZ] - vm state change: running
INFO[0001] [hostagent] Waiting for the essential requirement 1 of 3: "ssh"
INFO[0004] [hostagent] 2023/01/11 21:51:42 tcpproxy: for incoming conn 127.0.0.1:52962, error dialing "192.168.5.15:22": connect tcp 192.168.5.15:22: no route
INFO[0014] [hostagent] Waiting for the essential requirement 1 of 3: "ssh"
INFO[0017] [hostagent] 2023/01/11 21:51:56 tcpproxy: for incoming conn 127.0.0.1:52971, error dialing "192.168.5.15:22": connect tcp 192.168.5.15:22: no route
INFO[0027] [hostagent] Waiting for the essential requirement 1 of 3: "ssh"
INFO[0030] [hostagent] 2023/01/11 21:52:09 tcpproxy: for incoming conn 127.0.0.1:52974, error dialing "192.168.5.15:22": connect tcp 192.168.5.15:22: no route
INFO[0040] [hostagent] Waiting for the essential requirement 1 of 3: "ssh"
INFO[0043] [hostagent] 2023/01/11 21:52:22 tcpproxy: for incoming conn 127.0.0.1:52975, error dialing "192.168.5.15:22": connect tcp 192.168.5.15:22: no route
INFO[0053] [hostagent] Waiting for the essential requirement 1 of 3: "ssh"
INFO[0056] [hostagent] 2023/01/11 21:52:35 tcpproxy: for incoming conn 127.0.0.1:52976, error dialing "192.168.5.15:22": connect tcp 192.168.5.15:22: no route
INFO[0066] [hostagent] Waiting for the essential requirement 1 of 3: "ssh"
INFO[0069] [hostagent] 2023/01/11 21:52:48 tcpproxy: for incoming conn 127.0.0.1:52977, error dialing "192.168.5.15:22": connect tcp 192.168.5.15:22: no route
INFO[0079] [hostagent] Waiting for the essential requirement 1 of 3: "ssh"
INFO[0082] [hostagent] 2023/01/11 21:53:01 tcpproxy: for incoming conn 127.0.0.1:52979, error dialing "192.168.5.15:22": connect tcp 192.168.5.15:22: no route
kurtraschke commented 1 year ago

Did you experience 100% CPU usage from the hypervisor process while this was happening?

I suspect this may be a duplicate of #1288, and possibly #1136 as well - there seems to be a failure mode (specifically with Virtualization.framework where the VM crashes, the hypervisor process uses 100% host CPU (times the number of cores allocated to the VM, of course), and attempts to restart the VM are not successful. In my experience, this happens with vz VMs left running for a period of time, and seems to be accelerated (or aggravated) by host sleep-wake cycles.

larsyencken commented 1 year ago

No, the hypervisor process seems to be quiet:

Screenshot 2023-01-12 at 16 39 25

I completely agree that host sleep/wake cycles are involved. Lima very reliably runs, but any issues I've experienced seem to happen after the laptop has been put to sleep then woken up again. I wonder if it's possible to use a hook of some sort to shut down the VM before sleep.

I agree, I think this is another report of #1288.

I also checked now whether I can spin up new machines. Both default and experimental/vz templates come up clean, for some reason it's just my existing VM that can no longer start.

gilliardmacedo commented 1 year ago

Got the same when creating a new VM:

limactl start --name=default lima_arm.yaml
? Creating an instance "default" Proceed with the current configuration
WARN[0009] Ignoring: vmType vz: [CPUType]               
INFO[0009] Attempting to download the image from "https://cloud-images.ubuntu.com/releases/22.10/release-20221201/ubuntu-22.10-server-cloudimg-arm64.img"  digest="sha256:9575dfe9f925ec251a933b88a38c5582a18e9d19495025ac01cb2e217e5f14ca"
INFO[0009] Using cache "/Users/gilliardmvc/Library/Caches/lima/download/by-url-sha256/6bbad93b29a1faf5328aac270b23b85f75b3fce9d579c3e5bc91b7abfb870647/data" 
INFO[0011] Attempting to download the nerdctl archive from "https://github.com/containerd/nerdctl/releases/download/v1.1.0/nerdctl-full-1.1.0-linux-arm64.tar.gz"  digest="sha256:3b613a1be5a24460c44bb93a3609b790ada94e06efd1a86467d45bec7da8b449"
INFO[0011] Using cache "/Users/gilliardmvc/Library/Caches/lima/download/by-url-sha256/c69be86b3e48430e3d687d54361cf15b90e2c067fae6a294ca6292d41f42bf0e/data" 
INFO[0012] [hostagent] Starting VZ (hint: to watch the boot progress, see "/Users/gilliardmvc/.lima/default/serial.log") 
INFO[0012] [hostagent] Setting up Rosetta share         
INFO[0013] SSH Local Port: 60022                        
INFO[0013] [hostagent] new connection from  to   
INFO[0013] [hostagent] Waiting for the essential requirement 1 of 5: "ssh" 
INFO[0013] [hostagent] [VZ] - vm state change: running  
INFO[0016] [hostagent] 2023/01/17 21:04:57 tcpproxy: for incoming conn 127.0.0.1:52287, error dialing "192.168.5.15:22": connect tcp 192.168.5.15:22: no route 
INFO[0023] [hostagent] Waiting for the essential requirement 1 of 5: "ssh" 
INFO[0033] [hostagent] Waiting for the essential requirement 1 of 5: "ssh" 
INFO[0043] [hostagent] Waiting for the essential requirement 1 of 5: "ssh" 
INFO[0053] [hostagent] Waiting for the essential requirement 1 of 5: "ssh" 

Lima 0.14.2 Using "vz"

Only ocurrs when I use a template. If I create a fresh new VM, I got the same error once, but quickly requirement is satisfied

ejames17 commented 1 year ago

I am having the same issue as well. Usually happens after my mac has been idle for some time.

t0ch1k commented 1 year ago

I'm facing with the same issue and also noticed it happens after mac being idle for quite long time (10+ hours). Unfortunately only factory-reset of VM helps :(

ejames17 commented 11 months ago

I was able to resolve it by preventing the guest machine from sleeping. You can do a factory reset on the guest machine then ssh into it and from the terminal run

sudo systemctl mask sleep.target suspend.target hibernate.target hybrid-sleep.target

This will disable the guest machine and ssh services from sleeping.

loheagn commented 7 months ago

I face the same issue on m3 pro mbp.

Here is the output of the limactl start command:

$ limactl start ccr-dev

INFO[0000] Using the existing instance "ccr-dev"
INFO[0000] Starting the instance "ccr-dev" with VM driver "vz"
INFO[0001] [hostagent] hostagent socket created at /Users/bytedance/.lima/ccr-dev/ha.sock
INFO[0001] [hostagent] Starting VZ (hint: to watch the boot progress, see "/Users/bytedance/.lima/ccr-dev/serial*.log")
INFO[0001] [hostagent] new connection from  to
INFO[0001] SSH Local Port: 52832
INFO[0001] [hostagent] [VZ] - vm state change: running
INFO[0001] [hostagent] Waiting for the essential requirement 1 of 2: "ssh"
INFO[0011] [hostagent] Waiting for the essential requirement 1 of 2: "ssh"
INFO[0014] [hostagent] 2024/01/12 15:57:26 tcpproxy: for incoming conn 127.0.0.1:52841, error dialing "192.168.5.15:22": connect tcp 192.168.5.15:22: no route to host
INFO[0024] [hostagent] Waiting for the essential requirement 1 of 2: "ssh"
INFO[0027] [hostagent] 2024/01/12 15:57:39 tcpproxy: for incoming conn 127.0.0.1:52850, error dialing "192.168.5.15:22": connect tcp 192.168.5.15:22: no route to host
INFO[0037] [hostagent] Waiting for the essential requirement 1 of 2: "ssh"
INFO[0040] [hostagent] 2024/01/12 15:57:52 tcpproxy: for incoming conn 127.0.0.1:52860, error dialing "192.168.5.15:22": connect tcp 192.168.5.15:22: no route to host
INFO[0050] [hostagent] Waiting for the essential requirement 1 of 2: "ssh"
INFO[0053] [hostagent] 2024/01/12 15:58:05 tcpproxy: for incoming conn 127.0.0.1:52871, error dialing "192.168.5.15:22": connect tcp 192.168.5.15:22: no route to host
INFO[0063] [hostagent] Waiting for the essential requirement 1 of 2: "ssh"
INFO[0066] [hostagent] 2024/01/12 15:58:18 tcpproxy: for incoming conn 127.0.0.1:52881, error dialing "192.168.5.15:22": connect tcp 192.168.5.15:22: no route to host
INFO[0076] [hostagent] Waiting for the essential requirement 1 of 2: "ssh"
INFO[0079] [hostagent] 2024/01/12 15:58:31 tcpproxy: for incoming conn 127.0.0.1:52890, error dialing "192.168.5.15:22": connect tcp 192.168.5.15:22: no route to host
INFO[0089] [hostagent] Waiting for the essential requirement 1 of 2: "ssh"
INFO[0092] [hostagent] 2024/01/12 15:58:44 tcpproxy: for incoming conn 127.0.0.1:52897, error dialing "192.168.5.15:22": connect tcp 192.168.5.15:22: no route to host
INFO[0102] [hostagent] Waiting for the essential requirement 1 of 2: "ssh"
INFO[0105] [hostagent] 2024/01/12 15:58:57 tcpproxy: for incoming conn 127.0.0.1:52910, error dialing "192.168.5.15:22": connect tcp 192.168.5.15:22: no route to host
INFO[0115] [hostagent] Waiting for the essential requirement 1 of 2: "ssh"
INFO[0118] [hostagent] 2024/01/12 15:59:10 tcpproxy: for incoming conn 127.0.0.1:52927, error dialing "192.168.5.15:22": connect tcp 192.168.5.15:22: no route to host
INFO[0128] [hostagent] Waiting for the essential requirement 1 of 2: "ssh"
INFO[0131] [hostagent] 2024/01/12 15:59:23 tcpproxy: for incoming conn 127.0.0.1:52938, error dialing "192.168.5.15:22": connect tcp 192.168.5.15:22: no route to host
backroot commented 6 months ago

I am having the same issue as well too. (lima: stable 0.19.1)

ikarlashov commented 3 months ago

Ok, I found the issue. Spent a few hours, ofc :( What can I say: handling errors during VM startup could be done better on lima side (sorry, I can't help with that by making PR, just sharing my thoughts).

Apparently, whoever gets tcpproxy: for incoming conn 127.0.0.1:52938, error dialing "192.168.5.15:22": connect tcp 192.168.5.15:22: no route to host has an issue with VM's guest system startup process. In my case there were multiple issues with VM's disk filesystem. By making changes to my VM lima.yaml:

video:
  display: "vz"

I could actually see the startup process of my VM. So, the OS didn't start properly. I was dropped to command-line of busybox shell. From where I could run fschk and fix the filesystem. And then "exit" and ubuntu started normally.

backroot commented 3 months ago

After upgrading to lima 0.21.0, this problem no longer occurs in my environment.