lima-vm / lima

Linux virtual machines, with a focus on running containers
https://lima-vm.io/
Apache License 2.0
15.5k stars 608 forks source link

Guest agent sometimes becomes unavailable from the host #2227

Open roman-kiselenko opened 9 months ago

roman-kiselenko commented 9 months ago

Description

I've found this abiosoft/colima/issues/994 and it looks exactly like mine.

After sometime the created VM started to loose connection to guestagent due to missing ga.sock.

journalctl -e -u lima-guestagent.service -f ```log -- Boot 3c14c00ca1b045588ca9d3a840437668 -- Mar 02 15:34:31 lima-cri-tools systemd[1]: Started lima-guestagent. Mar 02 15:34:33 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:34:33+03:00" level=info msg="event tick: 3s" Mar 02 15:34:33 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:34:33+03:00" level=info msg="serving the guest agent on \"/run/lima-guestagent.sock\"" Mar 02 15:34:36 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:34:36+03:00" level=debug msg="fixSystemTimeSkew: rtc=2024-03-02T12:34:36Z systime=2024-03-02T15:34:36+03:00 delta=-150.01032ms" Mar 02 15:34:46 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:34:46+03:00" level=debug msg="fixSystemTimeSkew: rtc=2024-03-02T12:34:46Z systime=2024-03-02T15:34:46+03:00 delta=-152.18532ms" Mar 02 15:34:56 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:34:56+03:00" level=debug msg="fixSystemTimeSkew: rtc=2024-03-02T12:34:56Z systime=2024-03-02T15:34:56+03:00 delta=-155.54332ms" Mar 02 15:35:06 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:35:06+03:00" level=debug msg="fixSystemTimeSkew: rtc=2024-03-02T12:35:06Z systime=2024-03-02T15:35:06+03:00 delta=-151.786742ms" Mar 02 15:35:16 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:35:16+03:00" level=debug msg="fixSystemTimeSkew: rtc=2024-03-02T12:35:16Z systime=2024-03-02T15:35:16+03:00 delta=-146.093647ms" Mar 02 15:35:21 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:35:21+03:00" level=debug msg="setWorthCheckingIPTablesRoutine(): setting to true" Mar 02 15:35:21 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:35:21+03:00" level=debug msg="setWorthCheckingIPTablesRoutine(): setting to true" Mar 02 15:35:21 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:35:21+03:00" level=debug msg="setWorthCheckingIPTablesRoutine(): setting to true" Mar 02 15:35:21 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:35:21+03:00" level=debug msg="setWorthCheckingIPTablesRoutine(): setting to true" Mar 02 15:35:21 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:35:21+03:00" level=debug msg="setWorthCheckingIPTablesRoutine(): setting to true" Mar 02 15:35:26 lima-cri-tools lima-guestagent[1418]: time="2024-03-02T15:35:26+03:00" level=debug msg="fixSystemTimeSkew: rtc=2024-03-02T12:35:26Z systime=2024-03-02T15:35:26+03:00 delta=-149.117589ms" Mar 02 15:35:27 lima-cri-tools systemd[1]: Stopping lima-guestagent... Mar 02 15:35:27 lima-cri-tools systemd[1]: lima-guestagent.service: Deactivated successfully. Mar 02 15:35:27 lima-cri-tools systemd[1]: Stopped lima-guestagent. Mar 02 15:35:27 lima-cri-tools systemd[1]: Started lima-guestagent. Mar 02 15:35:28 lima-cri-tools lima-guestagent[2425]: time="2024-03-02T15:35:28+03:00" level=info msg="event tick: 3s" Mar 02 15:35:28 lima-cri-tools lima-guestagent[2425]: time="2024-03-02T15:35:28+03:00" level=info msg="serving the guest agent on \"/run/lima-guestagent.sock\"" Mar 02 15:35:34 lima-cri-tools lima-guestagent[2425]: time="2024-03-02T15:35:34+03:00" level=debug msg="LocalPorts(): worthCheckingIPTables=true" Mar 02 15:35:34 lima-cri-tools lima-guestagent[2425]: time="2024-03-02T15:35:34+03:00" level=debug msg="Get info %+v{[{0.0.0.0 22} {127.0.0.1 35967} {127.0.0.53 53} {:: 22}]}" ```

The logs above shows that the lima-guestagent.service has been restarted (🤔 not sure the reason). This leads to the missing connection from the host:

WARN[0110] [hostagent] connection to the guest agent was closed unexpectedly

Missing connection leads to ga.sock deleting.

I tried to comment ga.sock deleting, and it appears to be a fix.

diff ```diff diff --git a/pkg/hostagent/hostagent.go b/pkg/hostagent/hostagent.go index 92c5073..ee4c9a3 100644 --- a/pkg/hostagent/hostagent.go +++ b/pkg/hostagent/hostagent.go @@ -605,7 +605,7 @@ func (a *HostAgent) watchGuestAgentEvents(ctx context.Context) { } } } else { - logrus.WithError(err).Warn("connection to the guest agent was closed unexpectedly", err) + logrus.WithError(err).Warn("connection to the guest agent was closed unexpectedly\n", err) } select { case <-ctx.Done(): @@ -728,13 +728,14 @@ func forwardSSH(ctx context.Context, sshConfig *ssh.SSHConfig, port int, local, } } else { logrus.Infof("Forwarding %q (guest) to %q (host)", remote, local) - if err := os.RemoveAll(local); err != nil { - logrus.WithError(err).Warnf("Failed to clean up %q (host) before setting up forwarding", local) - } - } - if err := os.MkdirAll(filepath.Dir(local), 0o750); err != nil { - return fmt.Errorf("can't create directory for local socket %q: %w", local, err) + logrus.Infof("Delete files %q (local)", local) + // if err := os.RemoveAll(local); err != nil { + // logrus.WithError(err).Warnf("Failed to clean up %q (host) before setting up forwarding", local) + // } } ```

hostagent reconnect and everything is working again.

limactl start crio-tools ```sh ..... INFO[0106] [hostagent] Waiting for the guest agent to be running INFO[0106] [hostagent] Forwarding "/run/lima-guestagent.sock" (guest) to "/Users/user/.lima/cri-tools/ga.sock" (host) INFO[0106] [hostagent] Delete files "/Users/user/.lima/cri-tools/ga.sock" (local) WARN[0106] [hostagent] Failed to set up forward from "/run/lima-guestagent.sock" (guest) to "/Users/user/.lima/cri-tools/ga.sock" (host) WARN[0106] [hostagent] connection to the guest agent was closed unexpectedly dial unix /Users/user/.lima/cri-tools/ga.sock: connect: no such file or directory INFO[0116] [hostagent] Forwarding "/run/lima-guestagent.sock" (guest) to "/Users/user/.lima/cri-tools/ga.sock" (host) INFO[0116] [hostagent] Delete files "/Users/user/.lima/cri-tools/ga.sock" (local) INFO[0116] [hostagent] Guest agent is running INFO[0116] [hostagent] Waiting for the final requirement 1 of 1: "boot scripts must have finished" INFO[0116] [hostagent] Not forwarding TCP [::]:22 INFO[0116] [hostagent] Not forwarding TCP 127.0.0.53:53 INFO[0116] [hostagent] Forwarding TCP from 127.0.0.1:44569 to 127.0.0.1:44569 INFO[0116] [hostagent] Not forwarding TCP 0.0.0.0:22 INFO[0116] [hostagent] The final requirement 1 of 1 is satisfied INFO[0116] READY. Run `limactl shell cri-tools` to open the shell. ```

I see two options here (the second one is better, I guess, but I can't find the reason):

  1. In case of missing connection to the guestagent trying to reconnect and don't delete ga.sock
  2. Find a reason why lima-guestagent.service is failing.
lima.yml ```yaml vmType: null os: null arch: x86_64 images: - location: "https://cloud-images.ubuntu.com/releases/22.04/release-20240125/ubuntu-22.04-server-cloudimg-amd64.img" arch: "x86_64" digest: "sha256:f23e5b91af7f52bdd4680d8e9d091093bfc6c4a46b0da7baa3cb450ee917b66b" cpus: 6 memory: 8GiB disk: null mounts: - location: "~/Documents/dev/opensource/cri-tools" mountPoint: null writable: true sshfs: cache: null followSymlinks: null sftpDriver: null 9p: securityModel: null protocolVersion: null msize: null cache: "mmap" - location: "/tmp/lima" writable: true mountType: 9p additionalDisks: ssh: localPort: 0 loadDotSSHPubKeys: null forwardAgent: null forwardX11: null forwardX11Trusted: null caCerts: removeDefaults: null files: certs: upgradePackages: true containerd: system: false user: false cpuType: aarch64: null armv7l: null x86_64: null rosetta: enabled: null binfmt: null timezone: null firmware: legacyBIOS: null audio: device: null video: display: null vnc: display: null networks: propagateProxyEnv: null hostResolver: enabled: null ipv6: null hosts: guestInstallPrefix: null plain: null ```
> limactl --version
limactl version 0.20.1
> qemu-system-x86_64 --version
QEMU emulator version 8.2.1
Copyright (c) 2003-2023 Fabrice Bellard and the QEMU Project developers
AkihiroSuda commented 9 months ago

Is this issue specific to colima?

Do you see the issue with a plain Lima instance?

roman-kiselenko commented 9 months ago

Is this issue specific to colima?

Do you see the issue with a plain Lima instance?

This behaviour is specific to lima.

roman-kiselenko commented 8 months ago

Another related issue https://github.com/abiosoft/colima/issues/999#issuecomment-1987873799

roman-kiselenko commented 7 months ago

ref https://github.com/lima-vm/lima/issues/2064