Open jandubois opened 11 months ago
FWIW, I've been able to (sometimes) reproduce this with the default (Ubuntu-based) image. It's just using a script to start and stop in a loop:
#!/usr/bin/env bash
set -o errexit
machine="$(limactl list --format '{{.Name}}')"
while true; do
limactl list
status="$(limactl list "${machine}" --format '{{ .Status }}')"
case "$status" in
Stopped) limactl --debug start "${machine}";;
Running) limactl stop "${machine}";;
Broken) exit;;
esac
sleep 1
done
Eventually limactl start
returns an error (I don't believe I ever reached the Broken
state).
This was at 04e3886be95ef9ec02e362a1c990078abe735e94. The VM was created with limactl create
and accepting all defaults. QEMU (from homebrew) 8.1.2.
Both @mook-as and I have run git bisect
with this updated script:
#!/usr/bin/env bash
mkdir -p bisect
make binaries
machine="$(limactl list --format '{{.Name}}')"
commit="$(git log -1 --format=%H)"
count=0
errors=0
while [[ "$count" -lt 100 ]]; do
limactl list
printf "\e[0;1;32mcount=%d errors=%d\e[0m\n" "${count}" "${errors}"
status="$(limactl list "${machine}" --format '{{ .Status }}')"
case "$status" in
Stopped)
if limactl --debug start "${machine}" >"bisect/${commit}.log" 2>&1; then
count=$((count+1))
else
if grep --quiet 'guest agent does not seem to be running' "bisect/${commit}.log"; then
exit 1
fi
errors=$((errors+1))
if [[ "$errors" -gt 10 ]]; then
exit 125
fi
fi
;;
Running) limactl stop "${machine}" ;;
esac
done
It needs $PWD/_output/bin
in the PATH
and you must have created a single instance for testing.
Bisecting between v0.18.0 and v0.19.0. For both of us (on M1 machines) the bisect identified hostagent: avoid requiring fuser
and ss
in the guest ; CI: update colima (v0.5.5→v0.6.5) by AkihiroSuda · #2009 as the "first bad commit". I still have another bisect running on an Intel machine, but it is a very slow box; may not have results until tomorrow.
I tried adding sleep 60
after the limactl stop
, in case qemu was not finished by the time the hostagent exited, but it made no difference. Extending the hostagent timeout from 1min to 5min also did not make a difference (beyond it now taking 5min to report that the guestagent is unreachable).
@AkihiroSuda Any ideas?
PS: Yes, the bisect script is explicitly looking for the error message that was introduced in the PR that was eventually found. But the old code had never indicated that there was a problem communicating with the guestagent.
Note that the guestagent is running; it is the code trying to communicate with it that is failing.
@jandubois Are you able to view guestagent logs ?? Any connection closed like errors thrown ??
This might be also due to the use of serialport as channel. Maybe some error cases not handled properly.
@jandubois Are you able to view guestagent logs ?? Any connection closed like errors thrown ??
Yes, I can see the logs. Now it happened to me for the first time on a fresh start:
$ l start
? Creating an instance "default" Proceed with the current configuration
INFO[0000] Starting the instance "default" with VM driver "qemu"
INFO[0000] QEMU binary "/opt/homebrew/bin/qemu-system-aarch64" seems properly signed with the "com.apple.security.hypervisor" entitlement
[...]
INFO[0099] [hostagent] Waiting for the final requirement 1 of 1: "boot scripts must have finished"
INFO[0099] [hostagent] The final requirement 1 of 1 is satisfied
ERRO[0099] [guest agent does not seem to be running; port forwards will not work]
WARN[0099] DEGRADED. The VM seems running, but file sharing and port forwarding may not work. (hint: see "/Users/jan/.lima/default/ha.stderr.log")
FATA[0099] degraded, status={Running:true Degraded:true Exiting:false Errors:[guest agent does not seem to be running; port forwards will not work] SSHLocalPort:60022}
The guestagent log indicates that the agent was stopped and started again, which feels odd:
$ lima sudo journalctl -b -u lima-guestagent
Jan 02 11:21:14 lima-default systemd[1]: Started lima-guestagent.service - lima-guestagent.
Jan 02 11:21:14 lima-default systemd[1]: Stopping lima-guestagent.service - lima-guestagent...
Jan 02 11:21:14 lima-default systemd[1]: lima-guestagent.service: Deactivated successfully.
Jan 02 11:21:14 lima-default systemd[1]: Stopped lima-guestagent.service - lima-guestagent.
Jan 02 11:21:14 lima-default systemd[1]: Started lima-guestagent.service - lima-guestagent.
Jan 02 11:21:14 lima-default lima-guestagent[2622]: time="2024-01-02T11:21:14-08:00" level=info msg="event tick: 3s"
Jan 02 11:21:14 lima-default lima-guestagent[2622]: time="2024-01-02T11:21:14-08:00" level=info msg="serving the guest agent on qemu serial file: io.lima-vm.guest_agent.0"
I don't think this is always the case, but I will verify this.
There is no error in ha.stderr.log
and the instance no longer looks degraded, so maybe this is different from the other failures.
I don't think this is always the case, but I will verify this.
Have reproduced with the bisect script (just running against HEAD
of master
):
$ lima sudo journalctl -b -u lima-guestagent
Jan 02 11:39:44 lima-default systemd[1]: Started lima-guestagent.service - lima-guestagent.
Jan 02 11:39:44 lima-default lima-guestagent[1346]: time="2024-01-02T11:39:44-08:00" level=info msg="event tick: 3s"
Jan 02 11:39:44 lima-default lima-guestagent[1346]: time="2024-01-02T11:39:44-08:00" level=info msg="serving the guest agent on qemu serial file: io.lima-vm.guest_agent.0"
Jan 02 11:39:48 lima-default systemd[1]: Stopping lima-guestagent.service - lima-guestagent...
Jan 02 11:39:48 lima-default systemd[1]: lima-guestagent.service: Deactivated successfully.
Jan 02 11:39:48 lima-default systemd[1]: Stopped lima-guestagent.service - lima-guestagent.
Jan 02 11:39:48 lima-default systemd[1]: lima-guestagent.service: Consumed 5.085s CPU time.
Jan 02 11:39:48 lima-default systemd[1]: Started lima-guestagent.service - lima-guestagent.
Jan 02 11:39:48 lima-default lima-guestagent[2118]: time="2024-01-02T11:39:48-08:00" level=info msg="event tick: 3s"
Jan 02 11:39:48 lima-default lima-guestagent[2118]: time="2024-01-02T11:39:48-08:00" level=info msg="serving the guest agent on qemu serial file: io.lima-vm.guest_agent.0"
This time the first run of the guest agent gets to emit at least the 2 startup log lines, but otherwise looks the same.
But running things on Alpine (again with HEAD
of master
) I don't see the guest agent being restarted (but we still go into the degraded state because communication fails):
$ l shell alpine sudo rc-status | grep guestagent
lima-guestagent [ started 00:02:14 (0) ]
$ l shell alpine sudo cat /var/log/lima-guestagent.log
time="2024-01-02T19:49:02Z" level=info msg="event tick: 3s"
time="2024-01-02T19:49:02Z" level=info msg="serving the guest agent on qemu serial file: io.lima-vm.guest_agent.0"
I guess it is possible that the first run of guest agent didn't output anything, but I think the (0)
part of the rc-status
output means that supervise-daemon did not restart the service.
@balajiv113 It is pretty straight-forward to repro the issue: You must have just a single instance, and then run the script from https://github.com/lima-vm/lima/issues/2064#issuecomment-1865215469, which will stop and restart the instance until it breaks. Or edit the script to hard-code the instance name, if you have other instances that you need to keep.
Failure typically happens in the first 5-10 iterations, but sometimes takes longer.
This might be also due to the use of serialport as channel. Maybe some error cases not handled properly.
Possible, but in my testing simply reverting 482678b3ed408772429f50785908f8f233d1471e (and installing ss
in the instance when using VZ) was enough to "fix" this problem. So the problem seems to be with the new "alive" checking code and not the serial port channel by itself.
This might be also due to the use of serialport as channel. Maybe some error cases not handled properly.
Possible, but in my testing simply reverting 482678b (and installing
ss
in the instance when using VZ) was enough to "fix" this problem. So the problem seems to be with the new "alive" checking code and not the serial port channel by itself.
@balajiv113 I was wrong; the problem seems to be indeed due to the use of serialport as a channel, and not related to the updated readiness test. The bisect script was not testing correctly for the error condition. I've since re-run the bisect between 0.18.0 and 0.19.0 with QEMU and the default
Ubuntu image with this script:
#!/usr/bin/env bash
mkdir -p bisect
make binaries
machine="$(limactl list --format '{{.Name}}')"
commit="$(git log -1 --format=%H)"
count=0
errors=0
while [[ "$count" -lt 100 ]]; do
limactl list
printf "\e[0;1;32mcount=%d errors=%d\e[0m\n" "${count}" "${errors}"
status="$(limactl list "${machine}" --format '{{ .Status }}')"
case "$status" in
Stopped)
if limactl --debug start "${machine}" >"bisect/${commit}.log" 2>&1; then
count=$((count+1))
sleep 7
if [ "$machine" = "alpine" ]; then
limactl shell --workdir / "${machine}" sudo timeout 10 nc -l -p 4444 -s 0.0.0.0 2>/dev/null
else
limactl shell --workdir / "${machine}" sudo timeout 10 nc -l 4444
fi
sleep 7
if ! grep --quiet '0.0.0.0:4444' "$HOME/.lima/${machine}/ha.stderr.log"; then
exit 1
fi
else
exit 125
fi
;;
Running) limactl stop "${machine}" ;;
esac
done
Instead of relying on indirect evidence of the guest agent running, it opens the port 4444 inside the instance and then checks ha.stderr.log
that a forwarding request has been received.
This bisect identifies f947de0f837f419b40f9d0dea5b6d4e1a82a18bb as the "first bad commit" causing the script to fail.
So far I've still not seen any tests to fail with VZ, just with QEMU, so only the serial port code seems to be affected.
It seems like the fix has been merged. Is there a timeline for new release so it can be pulled by brew upgrade lima
without messing with building from master?
Released v0.20.
Will be available on Homebrew in a couple of hours
I use lima 0.21.0 but still have the problem.
same here, with lima 0.21.0 also
@DrScho @dubcl Are you on Linux or macOS? x86_64 or aarch64? Using QEMU or VZ?
Hi @jandubois I'm on macOS, Sonoma 14.4.1, arm64.
$ uname -a
Darwin MACHINENAME 23.4.0 Darwin Kernel Version 23.4.0: Fri Mar 15 00:10:42 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T6000 arm64
@dubcl Please confirm if you are using QEMU or VZ
sorry, QEMU
$ qemu-system-x86_64 -version
QEMU emulator version 8.2.1
sorry, QEMU
Thanks! That configuration should have been "fixed" by #2112 (and I could no longer reproduce it with Lima 0.20).
Just for completeness, are you running the default
template, or something else?
And does the guestagent disconnect right away, or only after the machine goes to sleep, or just after it has been running for a while? Does this happen with an empty instance, or only after you deploy some workload on it?
Sorry for all the questions, but this bug has been really time-consuming to debug before, so any information that helps to reproduce the issue more reliably would help someone to investigate.
@DrScho @dubcl Are you on Linux or macOS? x86_64 or aarch64? Using QEMU or VZ?
I am using Linux (0n MAC OS) 23.4.0 Darwin Kernel Version 23.4.0: Fri Mar 15 00:10:42 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T6000 arm64 and qemu QEMU emulator version 9.0.0 and lima 0.21.0 with default instance I just started 'limactl start' and there is no problem with the guest system. But when I try to start a docker container with 'docker run --name myName -p 127.0.0.1:1521:1521/tcp myImage' I still receive 'docker: 'run --name' is not a docker command.' although the command appears to be correct and docker in VM has the option --name
sorry, QEMU
Thanks! That configuration should have been "fixed" by #2112 (and I could no longer reproduce it with Lima 0.20).
I already use 0.21, I think this happens when update from .20 to .21
Just for completeness, are you running the
default
template, or something else?
yes I guess, this is my config:
# This template requires Lima v0.7.0 or later.
arch: "x86_64"
cpus: 4
memory: "6GiB"
vmType: "qemu"
firmware:
legacyBIOS: true
images:
# Try to use release-yyyyMMdd image if available. Note that release-yyyyMMdd will be removed after several months.
- location: "https://cloud-images.ubuntu.com/releases/23.04/release-20230810/ubuntu-23.04-server-cloudimg-amd64.img"
arch: "x86_64"
digest: "sha256:5ad255d32a30a2cda9f0df19f0a6ce8d6f3c81b63845086a4cb5c43cf97fcb92"
- location: "https://cloud-images.ubuntu.com/releases/23.04/release-20230810/ubuntu-23.04-server-cloudimg-arm64.img"
arch: "aarch64"
digest: "sha256:af62ca6ba307388f7e0a8ad1c46103e6aea0130a09122e818df8d711637bf998"
# Fallback to the latest release image.
# Hint: run `limactl prune` to invalidate the cache
- location: "https://cloud-images.ubuntu.com/releases/23.04/release/ubuntu-23.04-server-cloudimg-amd64.img"
arch: "x86_64"
- location: "https://cloud-images.ubuntu.com/releases/23.04/release/ubuntu-23.04-server-cloudimg-arm64.img"
arch: "aarch64"
mounts:
- location: "~"
- location: "/tmp/lima"
writable: true
And does the guestagent disconnect right away, or only after the machine goes to sleep, or just after it has been running for a while?
I'm not sure, I never put to sleep, is always on because I use it as desktop mostly.
Does this happen with an empty instance, or only after you deploy some workload on it?
I realize when I try to do a port forward for a nginx image that I tried to test
Sorry for all the questions, but this bug has been really time-consuming to debug before, so any information that helps to reproduce the issue more reliably would help someone to investigate.
np, glad to help you.
Additionally, to test, I removed the default folder and start a "new default" with same default.yaml and all work ok. To follow, If I get the error again, I will back to comment you.
Well, at least not initially.
[Update: it seems to start, but hostagent cannot connect to it]
We've observed this multiple times with Rancher Desktop after switching to Lima 0.19.0 (https://github.com/rancher-sandbox/rancher-desktop/issues/6151).
AFAIK, this was always with QEMU and not VZ, and always when restarting an existing instance, and not when creating a new one.
Rancher Desktop is using Alpine; I have not seen it with other distros, but that doesn't mean much.
I don't think guest-agent logs are written to file on Alpine; I will need to look into fixing that.
I've run into it twice today on different machines, but there don't seem to be any steps to reliably reproduce the problem.