lima-vm / lima

Linux virtual machines, with a focus on running containers
https://lima-vm.io/
Apache License 2.0
14.58k stars 571 forks source link

Guest agent sometimes doesn't start (or is unreachable from host agent) #2064

Open jandubois opened 6 months ago

jandubois commented 6 months ago

Well, at least not initially.

[Update: it seems to start, but hostagent cannot connect to it]

We've observed this multiple times with Rancher Desktop after switching to Lima 0.19.0 (https://github.com/rancher-sandbox/rancher-desktop/issues/6151).

AFAIK, this was always with QEMU and not VZ, and always when restarting an existing instance, and not when creating a new one.

Rancher Desktop is using Alpine; I have not seen it with other distros, but that doesn't mean much.

I don't think guest-agent logs are written to file on Alpine; I will need to look into fixing that.

I've run into it twice today on different machines, but there don't seem to be any steps to reliably reproduce the problem.

    'time="2023-12-13T10:04:13-08:00" level=info msg="[hostagent] Waiting for the guest agent to be running"\n' +
    'time="2023-12-13T10:04:13-08:00" level=info msg="[hostagent] Forwarding \\"/var/run/docker.sock\\" (guest) to \\"/Users/jan/.rd/docker.sock\\" (host)"\n' +
    'time="2023-12-13T10:05:13-08:00" level=info msg="[hostagent] Waiting for the final requirement 1 of 1: \\"boot scripts must have finished\\""\n' +
    'time="2023-12-13T10:05:13-08:00" level=info msg="[hostagent] The final requirement 1 of 1 is satisfied"\n' +
    'time="2023-12-13T10:05:13-08:00" level=error msg="[guest agent does not seem to be running; port forwards will not work]"\n' +
    'time="2023-12-13T10:05:13-08:00" level=warning msg="DEGRADED. The VM seems running, but file sharing and port forwarding may not work. (hint: see \\"/Users/jan/Library/Application Support/rancher-desktop/lima/0/ha.stderr.log\\")"\n' +
    'time="2023-12-13T10:05:13-08:00" level=fatal msg="degraded, status={Running:true Degraded:true Exiting:false Errors:[guest agent does not seem to be running; port forwards will not work] SSHLocalPort:63025}"\n',
    'time="2023-12-13T10:35:24-08:00" level=info msg="[hostagent] Waiting for the guest agent to be running"\n' +
    'time="2023-12-13T10:35:24-08:00" level=info msg="[hostagent] Forwarding \\"/var/run/docker.sock\\" (guest) to \\"/Users/jan/.rd/docker.sock\\" (host)"\n' +
    'time="2023-12-13T10:36:24-08:00" level=info msg="[hostagent] Waiting for the final requirement 1 of 1: \\"boot scripts must have finished\\""\n' +
    'time="2023-12-13T10:36:24-08:00" level=info msg="[hostagent] The final requirement 1 of 1 is satisfied"\n' +
    'time="2023-12-13T10:36:24-08:00" level=error msg="[guest agent does not seem to be running; port forwards will not work]"\n' +
    'time="2023-12-13T10:36:24-08:00" level=warning msg="DEGRADED. The VM seems running, but file sharing and port forwarding may not work. (hint: see \\"/Users/jan/Library/Application Support/rancher-desktop/lima/0/ha.stderr.log\\")"\n' +
    'time="2023-12-13T10:36:24-08:00" level=fatal msg="degraded, status={Running:true Degraded:true Exiting:false Errors:[guest agent does not seem to be running; port forwards will not work] SSHLocalPort:62183}"\n',
mook-as commented 6 months ago

FWIW, I've been able to (sometimes) reproduce this with the default (Ubuntu-based) image. It's just using a script to start and stop in a loop:

#!/usr/bin/env bash
set -o errexit
machine="$(limactl list --format '{{.Name}}')"
while true; do
    limactl list
    status="$(limactl list "${machine}" --format '{{ .Status }}')"
    case "$status" in
    Stopped) limactl --debug start "${machine}";;
    Running) limactl stop "${machine}";;
    Broken) exit;;
    esac
    sleep 1
done

Eventually limactl start returns an error (I don't believe I ever reached the Broken state).

This was at 04e3886be95ef9ec02e362a1c990078abe735e94. The VM was created with limactl create and accepting all defaults. QEMU (from homebrew) 8.1.2.

stderr: ``` INFO[0022] [hostagent] Waiting for the guest agent to be running INFO[0082] [hostagent] Waiting for the final requirement 1 of 1: "boot scripts must have finished" DEBU[0082] [hostagent] executing script "boot scripts must have finished" DEBU[0082] [hostagent] executing ssh for script "boot scripts must have finished": /usr/bin/ssh [ssh -F /dev/null -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o NoHostAuthenticationForLocalhost=yes -o GSSAPIAuthentication=no -o PreferredAuthentications=publickey -o Compression=no -o BatchMode=yes -o IdentitiesOnly=yes -o Ciphers="^aes128-gcm@openssh.com,aes256-gcm@openssh.com" -o User=markyen -o ControlMaster=auto -o ControlPath=".../.lima/default/ssh.sock" -o ControlPersist=yes -p 60022 127.0.0.1 -- /bin/bash] DEBU[0082] [hostagent] stdout="", stderr="+ timeout 30s bash -c 'until sudo diff -q /run/lima-boot-done /mnt/lima-cidata/meta-data 2>/dev/null; do sleep 3; done'\n", err= INFO[0082] [hostagent] The final requirement 1 of 1 is satisfied DEBU[0082] received an event event="{2023-12-20 14:02:45.836987 -0800 PST {true true false [guest agent does not seem to be running; port forwards will not work] 60022}}" ERRO[0082] [guest agent does not seem to be running; port forwards will not work] WARN[0082] DEGRADED. The VM seems running, but file sharing and port forwarding may not work. (hint: see ".../.lima/default/ha.stderr.log") FATA[0082] degraded, status={Running:true Degraded:true Exiting:false Errors:[guest agent does not seem to be running; port forwards will not work] SSHLocalPort:60022} ```
sudo journalctl -b -u lima-guestagent ``` Dec 20 22:01:33 lima-default systemd[1]: Started lima-guestagent.service - lima-guestagent. Dec 20 22:01:34 lima-default lima-guestagent[1340]: time="2023-12-20T22:01:34Z" level=info msg="event tick: 3s" Dec 20 22:01:34 lima-default lima-guestagent[1340]: time="2023-12-20T22:01:34Z" level=info msg="serving the guest agent on qemu serial file: /dev/virtio-ports/io.lima-vm.guest_agent.0" Dec 20 22:01:39 lima-default systemd[1]: Stopping lima-guestagent.service - lima-guestagent... Dec 20 22:01:39 lima-default systemd[1]: lima-guestagent.service: Deactivated successfully. Dec 20 22:01:39 lima-default systemd[1]: Stopped lima-guestagent.service - lima-guestagent. Dec 20 22:01:39 lima-default systemd[1]: lima-guestagent.service: Consumed 6.750s CPU time. Dec 20 22:01:39 lima-default systemd[1]: Started lima-guestagent.service - lima-guestagent. Dec 20 22:01:39 lima-default lima-guestagent[2137]: time="2023-12-20T22:01:39Z" level=info msg="event tick: 3s" Dec 20 22:01:39 lima-default lima-guestagent[2137]: time="2023-12-20T22:01:39Z" level=info msg="serving the guest agent on qemu serial file: /dev/virtio-ports/io.lima-vm.guest_agent.0" ```
jandubois commented 6 months ago

Both @mook-as and I have run git bisect with this updated script:

#!/usr/bin/env bash

mkdir -p bisect
make binaries

machine="$(limactl list --format '{{.Name}}')"
commit="$(git log -1 --format=%H)"
count=0
errors=0

while [[ "$count" -lt 100 ]]; do
    limactl list
    printf "\e[0;1;32mcount=%d errors=%d\e[0m\n" "${count}" "${errors}"
    status="$(limactl list "${machine}" --format '{{ .Status }}')"
    case "$status" in
    Stopped)
        if limactl --debug start "${machine}" >"bisect/${commit}.log" 2>&1; then
            count=$((count+1))
        else
            if grep --quiet 'guest agent does not seem to be running' "bisect/${commit}.log"; then
                exit 1
            fi
            errors=$((errors+1))
            if [[ "$errors" -gt 10 ]]; then
                exit 125
            fi
        fi
        ;;
    Running) limactl stop "${machine}" ;;
    esac
done

It needs $PWD/_output/bin in the PATH and you must have created a single instance for testing.

Bisecting between v0.18.0 and v0.19.0. For both of us (on M1 machines) the bisect identified hostagent: avoid requiring fuser and ss in the guest ; CI: update colima (v0.5.5→v0.6.5) by AkihiroSuda · #2009 as the "first bad commit". I still have another bisect running on an Intel machine, but it is a very slow box; may not have results until tomorrow.

I tried adding sleep 60 after the limactl stop, in case qemu was not finished by the time the hostagent exited, but it made no difference. Extending the hostagent timeout from 1min to 5min also did not make a difference (beyond it now taking 5min to report that the guestagent is unreachable).

@AkihiroSuda Any ideas?

PS: Yes, the bisect script is explicitly looking for the error message that was introduced in the PR that was eventually found. But the old code had never indicated that there was a problem communicating with the guestagent.

Note that the guestagent is running; it is the code trying to communicate with it that is failing.

balajiv113 commented 6 months ago

@jandubois Are you able to view guestagent logs ?? Any connection closed like errors thrown ??

This might be also due to the use of serialport as channel. Maybe some error cases not handled properly.

jandubois commented 6 months ago

@jandubois Are you able to view guestagent logs ?? Any connection closed like errors thrown ??

Yes, I can see the logs. Now it happened to me for the first time on a fresh start:

$ l start
? Creating an instance "default" Proceed with the current configuration
INFO[0000] Starting the instance "default" with VM driver "qemu"
INFO[0000] QEMU binary "/opt/homebrew/bin/qemu-system-aarch64" seems properly signed with the "com.apple.security.hypervisor" entitlement
[...]
INFO[0099] [hostagent] Waiting for the final requirement 1 of 1: "boot scripts must have finished"
INFO[0099] [hostagent] The final requirement 1 of 1 is satisfied
ERRO[0099] [guest agent does not seem to be running; port forwards will not work]
WARN[0099] DEGRADED. The VM seems running, but file sharing and port forwarding may not work. (hint: see "/Users/jan/.lima/default/ha.stderr.log")
FATA[0099] degraded, status={Running:true Degraded:true Exiting:false Errors:[guest agent does not seem to be running; port forwards will not work] SSHLocalPort:60022}

The guestagent log indicates that the agent was stopped and started again, which feels odd:

$ lima sudo journalctl -b -u lima-guestagent
Jan 02 11:21:14 lima-default systemd[1]: Started lima-guestagent.service - lima-guestagent.
Jan 02 11:21:14 lima-default systemd[1]: Stopping lima-guestagent.service - lima-guestagent...
Jan 02 11:21:14 lima-default systemd[1]: lima-guestagent.service: Deactivated successfully.
Jan 02 11:21:14 lima-default systemd[1]: Stopped lima-guestagent.service - lima-guestagent.
Jan 02 11:21:14 lima-default systemd[1]: Started lima-guestagent.service - lima-guestagent.
Jan 02 11:21:14 lima-default lima-guestagent[2622]: time="2024-01-02T11:21:14-08:00" level=info msg="event tick: 3s"
Jan 02 11:21:14 lima-default lima-guestagent[2622]: time="2024-01-02T11:21:14-08:00" level=info msg="serving the guest agent on qemu serial file: io.lima-vm.guest_agent.0"

I don't think this is always the case, but I will verify this.

There is no error in ha.stderr.log and the instance no longer looks degraded, so maybe this is different from the other failures.

ha.stderr.log

jandubois commented 6 months ago

I don't think this is always the case, but I will verify this.

Have reproduced with the bisect script (just running against HEAD of master):

$ lima sudo journalctl -b -u lima-guestagent
Jan 02 11:39:44 lima-default systemd[1]: Started lima-guestagent.service - lima-guestagent.
Jan 02 11:39:44 lima-default lima-guestagent[1346]: time="2024-01-02T11:39:44-08:00" level=info msg="event tick: 3s"
Jan 02 11:39:44 lima-default lima-guestagent[1346]: time="2024-01-02T11:39:44-08:00" level=info msg="serving the guest agent on qemu serial file: io.lima-vm.guest_agent.0"
Jan 02 11:39:48 lima-default systemd[1]: Stopping lima-guestagent.service - lima-guestagent...
Jan 02 11:39:48 lima-default systemd[1]: lima-guestagent.service: Deactivated successfully.
Jan 02 11:39:48 lima-default systemd[1]: Stopped lima-guestagent.service - lima-guestagent.
Jan 02 11:39:48 lima-default systemd[1]: lima-guestagent.service: Consumed 5.085s CPU time.
Jan 02 11:39:48 lima-default systemd[1]: Started lima-guestagent.service - lima-guestagent.
Jan 02 11:39:48 lima-default lima-guestagent[2118]: time="2024-01-02T11:39:48-08:00" level=info msg="event tick: 3s"
Jan 02 11:39:48 lima-default lima-guestagent[2118]: time="2024-01-02T11:39:48-08:00" level=info msg="serving the guest agent on qemu serial file: io.lima-vm.guest_agent.0"

This time the first run of the guest agent gets to emit at least the 2 startup log lines, but otherwise looks the same.

jandubois commented 6 months ago

But running things on Alpine (again with HEAD of master) I don't see the guest agent being restarted (but we still go into the degraded state because communication fails):

$ l shell alpine sudo rc-status | grep guestagent
 lima-guestagent                                       [  started 00:02:14 (0) ]

$ l shell alpine sudo cat /var/log/lima-guestagent.log
time="2024-01-02T19:49:02Z" level=info msg="event tick: 3s"
time="2024-01-02T19:49:02Z" level=info msg="serving the guest agent on qemu serial file: io.lima-vm.guest_agent.0"

I guess it is possible that the first run of guest agent didn't output anything, but I think the (0) part of the rc-status output means that supervise-daemon did not restart the service.

@balajiv113 It is pretty straight-forward to repro the issue: You must have just a single instance, and then run the script from https://github.com/lima-vm/lima/issues/2064#issuecomment-1865215469, which will stop and restart the instance until it breaks. Or edit the script to hard-code the instance name, if you have other instances that you need to keep.

Failure typically happens in the first 5-10 iterations, but sometimes takes longer.

jandubois commented 6 months ago

This might be also due to the use of serialport as channel. Maybe some error cases not handled properly.

Possible, but in my testing simply reverting 482678b3ed408772429f50785908f8f233d1471e (and installing ss in the instance when using VZ) was enough to "fix" this problem. So the problem seems to be with the new "alive" checking code and not the serial port channel by itself.

jandubois commented 6 months ago

This might be also due to the use of serialport as channel. Maybe some error cases not handled properly.

Possible, but in my testing simply reverting 482678b (and installing ss in the instance when using VZ) was enough to "fix" this problem. So the problem seems to be with the new "alive" checking code and not the serial port channel by itself.

@balajiv113 I was wrong; the problem seems to be indeed due to the use of serialport as a channel, and not related to the updated readiness test. The bisect script was not testing correctly for the error condition. I've since re-run the bisect between 0.18.0 and 0.19.0 with QEMU and the default Ubuntu image with this script:

#!/usr/bin/env bash

mkdir -p bisect
make binaries

machine="$(limactl list --format '{{.Name}}')"
commit="$(git log -1 --format=%H)"
count=0
errors=0

while [[ "$count" -lt 100 ]]; do
    limactl list
    printf "\e[0;1;32mcount=%d errors=%d\e[0m\n" "${count}" "${errors}"
    status="$(limactl list "${machine}" --format '{{ .Status }}')"
    case "$status" in
    Stopped)
        if limactl --debug start "${machine}" >"bisect/${commit}.log" 2>&1; then
            count=$((count+1))
            sleep 7
            if [ "$machine" = "alpine" ]; then
                limactl shell --workdir / "${machine}" sudo timeout 10 nc -l -p 4444 -s 0.0.0.0 2>/dev/null
            else
                limactl shell --workdir / "${machine}" sudo timeout 10 nc -l 4444
            fi
            sleep 7
            if ! grep --quiet '0.0.0.0:4444' "$HOME/.lima/${machine}/ha.stderr.log"; then
                exit 1
            fi
        else
            exit 125
        fi
        ;;
    Running) limactl stop "${machine}" ;;
    esac
done

Instead of relying on indirect evidence of the guest agent running, it opens the port 4444 inside the instance and then checks ha.stderr.log that a forwarding request has been received.

This bisect identifies f947de0f837f419b40f9d0dea5b6d4e1a82a18bb as the "first bad commit" causing the script to fail.

So far I've still not seen any tests to fail with VZ, just with QEMU, so only the serial port code seems to be affected.

iilyak commented 5 months ago

It seems like the fix has been merged. Is there a timeline for new release so it can be pulled by brew upgrade lima without messing with building from master?

AkihiroSuda commented 5 months ago

Released v0.20.

Will be available on Homebrew in a couple of hours

DrScho commented 2 months ago

I use lima 0.21.0 but still have the problem.

dubcl commented 2 months ago

same here, with lima 0.21.0 also

jandubois commented 2 months ago

@DrScho @dubcl Are you on Linux or macOS? x86_64 or aarch64? Using QEMU or VZ?

dubcl commented 2 months ago

Hi @jandubois I'm on macOS, Sonoma 14.4.1, arm64.

$ uname -a
Darwin MACHINENAME 23.4.0 Darwin Kernel Version 23.4.0: Fri Mar 15 00:10:42 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T6000 arm64
jandubois commented 2 months ago

@dubcl Please confirm if you are using QEMU or VZ

dubcl commented 2 months ago

sorry, QEMU

$ qemu-system-x86_64 -version
QEMU emulator version 8.2.1
jandubois commented 2 months ago

sorry, QEMU

Thanks! That configuration should have been "fixed" by #2112 (and I could no longer reproduce it with Lima 0.20).

Just for completeness, are you running the default template, or something else?

And does the guestagent disconnect right away, or only after the machine goes to sleep, or just after it has been running for a while? Does this happen with an empty instance, or only after you deploy some workload on it?

Sorry for all the questions, but this bug has been really time-consuming to debug before, so any information that helps to reproduce the issue more reliably would help someone to investigate.

DrScho commented 2 months ago

@DrScho @dubcl Are you on Linux or macOS? x86_64 or aarch64? Using QEMU or VZ?

I am using Linux (0n MAC OS) 23.4.0 Darwin Kernel Version 23.4.0: Fri Mar 15 00:10:42 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T6000 arm64 and qemu QEMU emulator version 9.0.0 and lima 0.21.0 with default instance I just started 'limactl start' and there is no problem with the guest system. But when I try to start a docker container with 'docker run --name myName -p 127.0.0.1:1521:1521/tcp myImage' I still receive 'docker: 'run --name' is not a docker command.' although the command appears to be correct and docker in VM has the option --name

dubcl commented 2 months ago

sorry, QEMU

Thanks! That configuration should have been "fixed" by #2112 (and I could no longer reproduce it with Lima 0.20).

I already use 0.21, I think this happens when update from .20 to .21

Just for completeness, are you running the default template, or something else?

yes I guess, this is my config:

# This template requires Lima v0.7.0 or later.
arch: "x86_64"
cpus: 4
memory: "6GiB"
vmType: "qemu"
firmware:
  legacyBIOS: true
images:
  # Try to use release-yyyyMMdd image if available. Note that release-yyyyMMdd will be removed after several months.
  - location: "https://cloud-images.ubuntu.com/releases/23.04/release-20230810/ubuntu-23.04-server-cloudimg-amd64.img"
    arch: "x86_64"
    digest: "sha256:5ad255d32a30a2cda9f0df19f0a6ce8d6f3c81b63845086a4cb5c43cf97fcb92"
  - location: "https://cloud-images.ubuntu.com/releases/23.04/release-20230810/ubuntu-23.04-server-cloudimg-arm64.img"
    arch: "aarch64"
    digest: "sha256:af62ca6ba307388f7e0a8ad1c46103e6aea0130a09122e818df8d711637bf998"
  # Fallback to the latest release image.
  # Hint: run `limactl prune` to invalidate the cache
  - location: "https://cloud-images.ubuntu.com/releases/23.04/release/ubuntu-23.04-server-cloudimg-amd64.img"
    arch: "x86_64"
  - location: "https://cloud-images.ubuntu.com/releases/23.04/release/ubuntu-23.04-server-cloudimg-arm64.img"
    arch: "aarch64"
mounts:
  - location: "~"
  - location: "/tmp/lima"
    writable: true

And does the guestagent disconnect right away, or only after the machine goes to sleep, or just after it has been running for a while?

I'm not sure, I never put to sleep, is always on because I use it as desktop mostly.

Does this happen with an empty instance, or only after you deploy some workload on it?

I realize when I try to do a port forward for a nginx image that I tried to test

Sorry for all the questions, but this bug has been really time-consuming to debug before, so any information that helps to reproduce the issue more reliably would help someone to investigate.

np, glad to help you.


Additionally, to test, I removed the default folder and start a "new default" with same default.yaml and all work ok. To follow, If I get the error again, I will back to comment you.