lima-vm / lima

Linux virtual machines, with a focus on running containers
https://lima-vm.io/
Apache License 2.0
14.66k stars 572 forks source link

connection to the guest agent was closed unexpectedlyunexpected EOF #2045

Open terev opened 7 months ago

terev commented 7 months ago

Description

colima version && limactl --version && sw_vers
colima version 0.6.6
git commit: 9ed7f4337861931b4d0192ca5409683a4b7d1cdc

runtime: docker
arch: aarch64

limactl version HEAD-cac97ed

ProductName:        macOS
ProductVersion:     14.1.1

I've recently been getting an issue where docker commands start failing because the connection through the forwarded socket is getting closed. It appears that the ssh process dies for some reason. While the socket is working lsof shows the ssh process has it open. But when the socket becomes unresponsive, lsof no longer shows this and the process id no longer exists.

from ~/.colima/_lima/colima/ha.stderr.log :

{"error":"unexpected EOF","level":"warning","msg":"connection to the guest agent was closed unexpectedlyunexpected EOF","time":"2023-11-30T23:21:26-05:00"}

Maybe in this scenario the port forwards could at least be re initialized. But it'd be great if we could avoid the ssh process dying altogether.

AkihiroSuda commented 7 months ago

With qemu or vz ?

terev commented 7 months ago

Oh sorry forgot to add that. This is VZ.

balajiv113 commented 7 months ago

@terev Will you be able to attach the log for this ?

We recently updated guest agent communication. Technically the connection should be re-established and this should not have any issues with ssh / port forwarding

terev commented 7 months ago

@balajiv113 sure. just the host agent log?

terev commented 7 months ago

~/.colima/_lima/colima/ha.stderr.log

{"level":"debug","msg":"Creating iso file /Users/trevorfoster/.colima/_lima/colima/cidata.iso","time":"2023-11-30T15:30:11-05:00"}
{"level":"debug","msg":"Using /var/folders/qm/13cn0c7x62n287t46wlmsx5r0000gp/T/diskfs_iso1103814643 as workspace","time":"2023-11-30T15:30:11-05:00"}
{"level":"debug","msg":"Failed to detect CPU features. Assuming that AES acceleration is available on this Apple silicon.","time":"2023-11-30T15:30:11-05:00"}
{"level":"debug","msg":"OpenSSH version 9.4.1 detected","time":"2023-11-30T15:30:11-05:00"}
{"level":"debug","msg":"AES accelerator seems available, prioritizing aes128-gcm@openssh.com and aes256-gcm@openssh.com","time":"2023-11-30T15:30:11-05:00"}
{"level":"info","msg":"hostagent socket created at /Users/trevorfoster/.colima/_lima/colima/ha.sock","time":"2023-11-30T15:30:11-05:00"}
{"level":"info","msg":"Starting VZ (hint: to watch the boot progress, see \"/Users/trevorfoster/.colima/_lima/colima/serial*.log\")","time":"2023-11-30T15:30:11-05:00"}
{"level":"info","msg":"Setting up Rosetta share","time":"2023-11-30T15:30:11-05:00"}
{"level":"debug","msg":"[VZ] - vm state change: \"VirtualMachineStateStarting\"","time":"2023-11-30T15:30:11-05:00"}
{"level":"info","msg":"Waiting for the essential requirement 1 of 2: \"ssh\"","time":"2023-11-30T15:30:11-05:00"}
{"level":"debug","msg":"executing script \"ssh\"","time":"2023-11-30T15:30:11-05:00"}
{"level":"info","msg":"[VZ] - vm state change: running","time":"2023-11-30T15:30:11-05:00"}
{"level":"debug","msg":"executing ssh for script \"ssh\": /usr/bin/ssh [ssh -F /dev/null -o IdentityFile=\"/Users/trevorfoster/.colima/_lima/_config/user\" -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o NoHostAuthenticationForLocalhost=yes -o GSSAPIAuthentication=no -o PreferredAuthentications=publickey -o Compression=no -o BatchMode=yes -o IdentitiesOnly=yes -o Ciphers=\"^aes128-gcm@openssh.com,aes256-gcm@openssh.com\" -o User=trevorfoster -o ControlMaster=auto -o ControlPath=\"/Users/trevorfoster/.colima/_lima/colima/ssh.sock\" -o ControlPersist=yes -p 60306 127.0.0.1 -- /bin/bash]","time":"2023-11-30T15:30:11-05:00"}
{"level":"debug","msg":"stdout=\"\", stderr=\"ssh: connect to host 127.0.0.1 port 60306: Connection refused\\r\\n\", err=failed to execute script \"ssh\": stdout=\"\", stderr=\"ssh: connect to host 127.0.0.1 port 60306: Connection refused\\r\\n\": exit status 255","time":"2023-11-30T15:30:11-05:00"}
{"level":"info","msg":"Waiting for the essential requirement 1 of 2: \"ssh\"","time":"2023-11-30T15:30:21-05:00"}
{"level":"debug","msg":"executing script \"ssh\"","time":"2023-11-30T15:30:21-05:00"}
{"level":"debug","msg":"executing ssh for script \"ssh\": /usr/bin/ssh [ssh -F /dev/null -o IdentityFile=\"/Users/trevorfoster/.colima/_lima/_config/user\" -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o NoHostAuthenticationForLocalhost=yes -o GSSAPIAuthentication=no -o PreferredAuthentications=publickey -o Compression=no -o BatchMode=yes -o IdentitiesOnly=yes -o Ciphers=\"^aes128-gcm@openssh.com,aes256-gcm@openssh.com\" -o User=trevorfoster -o ControlMaster=auto -o ControlPath=\"/Users/trevorfoster/.colima/_lima/colima/ssh.sock\" -o ControlPersist=yes -p 60306 127.0.0.1 -- /bin/bash]","time":"2023-11-30T15:30:21-05:00"}
{"level":"debug","msg":"stdout=\"\", stderr=\"\", err=\u003cnil\u003e","time":"2023-11-30T15:30:21-05:00"}
{"level":"info","msg":"The essential requirement 1 of 2 is satisfied","time":"2023-11-30T15:30:21-05:00"}
{"level":"info","msg":"Waiting for the essential requirement 2 of 2: \"user session is ready for ssh\"","time":"2023-11-30T15:30:21-05:00"}
{"level":"debug","msg":"executing script \"user session is ready for ssh\"","time":"2023-11-30T15:30:21-05:00"}
{"level":"debug","msg":"executing ssh for script \"user session is ready for ssh\": /usr/bin/ssh [ssh -F /dev/null -o IdentityFile=\"/Users/trevorfoster/.colima/_lima/_config/user\" -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o NoHostAuthenticationForLocalhost=yes -o GSSAPIAuthentication=no -o PreferredAuthentications=publickey -o Compression=no -o BatchMode=yes -o IdentitiesOnly=yes -o Ciphers=\"^aes128-gcm@openssh.com,aes256-gcm@openssh.com\" -o User=trevorfoster -o ControlMaster=auto -o ControlPath=\"/Users/trevorfoster/.colima/_lima/colima/ssh.sock\" -o ControlPersist=yes -p 60306 127.0.0.1 -- /bin/bash]","time":"2023-11-30T15:30:21-05:00"}
{"level":"debug","msg":"stdout=\"\", stderr=\"+ timeout 30s bash -c 'until sudo diff -q /run/lima-ssh-ready /mnt/lima-cidata/meta-data 2\u003e/dev/null; do sleep 3; done'\\n\", err=\u003cnil\u003e","time":"2023-11-30T15:30:21-05:00"}
{"level":"info","msg":"The essential requirement 2 of 2 is satisfied","time":"2023-11-30T15:30:21-05:00"}
{"level":"info","msg":"Waiting for the guest agent to be running","time":"2023-11-30T15:30:21-05:00"}
{"level":"debug","msg":"Forwarding unix sockets","time":"2023-11-30T15:30:21-05:00"}
{"level":"info","msg":"Forwarding \"/var/run/docker.sock\" (guest) to \"/Users/trevorfoster/.colima/default/docker.sock\" (host)","time":"2023-11-30T15:30:21-05:00"}
{"level":"info","msg":"Forwarding \"/var/run/docker.sock\" (guest) to \"/Users/trevorfoster/.colima/docker.sock\" (host)","time":"2023-11-30T15:30:22-05:00"}
{"level":"info","msg":"Guest agent is running","time":"2023-11-30T15:30:22-05:00"}
{"level":"debug","msg":"guest agent info: \u0026{LocalPorts:[{IP::: Port:22} {IP:127.0.0.54 Port:53} {IP:127.0.0.53 Port:53}]}","time":"2023-11-30T15:30:22-05:00"}
{"level":"info","msg":"Waiting for the final requirement 1 of 1: \"boot scripts must have finished\"","time":"2023-11-30T15:30:22-05:00"}
{"level":"debug","msg":"guest agent event: {Time:2023-11-30 20:30:22.137865462 +0000 UTC LocalPortsAdded:[{IP:127.0.0.54 Port:53} {IP:127.0.0.53 Port:53} {IP::: Port:22}] LocalPortsRemoved:[] Errors:[]}","time":"2023-11-30T15:30:22-05:00"}
{"level":"info","msg":"Not forwarding TCP 127.0.0.54:53","time":"2023-11-30T15:30:22-05:00"}
{"level":"debug","msg":"executing script \"boot scripts must have finished\"","time":"2023-11-30T15:30:22-05:00"}
{"level":"info","msg":"Not forwarding TCP 127.0.0.53:53","time":"2023-11-30T15:30:22-05:00"}
{"level":"info","msg":"Not forwarding TCP [::]:22","time":"2023-11-30T15:30:22-05:00"}
{"level":"debug","msg":"executing ssh for script \"boot scripts must have finished\": /usr/bin/ssh [ssh -F /dev/null -o IdentityFile=\"/Users/trevorfoster/.colima/_lima/_config/user\" -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o NoHostAuthenticationForLocalhost=yes -o GSSAPIAuthentication=no -o PreferredAuthentications=publickey -o Compression=no -o BatchMode=yes -o IdentitiesOnly=yes -o Ciphers=\"^aes128-gcm@openssh.com,aes256-gcm@openssh.com\" -o User=trevorfoster -o ControlMaster=auto -o ControlPath=\"/Users/trevorfoster/.colima/_lima/colima/ssh.sock\" -o ControlPersist=yes -p 60306 127.0.0.1 -- /bin/bash]","time":"2023-11-30T15:30:22-05:00"}
{"level":"debug","msg":"stdout=\"\", stderr=\"+ timeout 30s bash -c 'until sudo diff -q /run/lima-boot-done /mnt/lima-cidata/meta-data 2\u003e/dev/null; do sleep 3; done'\\n\", err=\u003cnil\u003e","time":"2023-11-30T15:30:22-05:00"}
{"level":"info","msg":"The final requirement 1 of 1 is satisfied","time":"2023-11-30T15:30:22-05:00"}
{"error":"unexpected EOF","level":"warning","msg":"connection to the guest agent was closed unexpectedlyunexpected EOF","time":"2023-11-30T23:21:26-05:00"}
{"level":"info","msg":"Guest agent is running","time":"2023-11-30T23:21:36-05:00"}
{"level":"debug","msg":"guest agent info: \u0026{LocalPorts:[{IP:127.0.0.54 Port:53} {IP:127.0.0.53 Port:53} {IP::: Port:22}]}","time":"2023-11-30T23:21:36-05:00"}
{"level":"debug","msg":"guest agent event: {Time:2023-12-01 04:21:36.130323144 +0000 UTC LocalPortsAdded:[{IP:127.0.0.54 Port:53} {IP:127.0.0.53 Port:53} {IP::: Port:22}] LocalPortsRemoved:[] Errors:[]}","time":"2023-11-30T23:21:36-05:00"}
{"level":"info","msg":"Not forwarding TCP 127.0.0.54:53","time":"2023-11-30T23:21:36-05:00"}
{"level":"info","msg":"Not forwarding TCP 127.0.0.53:53","time":"2023-11-30T23:21:36-05:00"}
{"level":"info","msg":"Not forwarding TCP [::]:22","time":"2023-11-30T23:21:36-05:00"}
terev commented 7 months ago

Something interesting I noticed, but probably doesn't matter is all of the processes are detached: ps -fx | grep -i lima

ps -fx | grep -i lima
31235     1   0 12:38AM ??         0:00.66 /opt/homebrew/bin/colima daemon start default --inotify --inotify-runtime docker --inotify-dir /Users/trevorfoster/ --inotify-dir /tmp/colima/
31270     1   0 12:38AM ??         0:00.09 ssh: /Users/trevorfoster/.colima/_lima/colima/ssh.sock [mux]
85750     1   0  3:30PM ttys000    0:40.61 /opt/homebrew/bin/limactl usernet -p /Users/trevorfoster/.colima/_lima/_networks/user-v2/usernet_user-v2.pid -e /Users/trevorfoster/.colima/_lima/_networks/user-v2/user-v2_ep.sock --listen-qemu /Users/trevorfoster/.colima/_lima/_networks/user-v2/user-v2_qemu.sock --listen /Users/trevorfoster/.colima/_lima/_networks/user-v2/user-v2_fd.sock --subnet 192.168.5.0/24 --leases 192.168.5.1=52:55:55:60:93:05,192.168.5.2=5a:94:ef:e4:0c:dd
31248     1   0 12:38AM ttys001    0:00.48 /opt/homebrew/bin/limactl hostagent --pidfile /Users/trevorfoster/.colima/_lima/colima/ha.pid --socket /Users/trevorfoster/.colima/_lima/colima/ha.sock colima

Which I guess makes sense because i assume the cli starts them all.

My thought around this is, could the nofile limit somehow not be taking effect if they're detached?

terev commented 7 months ago

Just reproduced again. But this time the guest agent isn't responding at all:

{"error":"unable to connect to guest agent via vsock port 2222","level":"warning","msg":"connection to the guest agent was closed unexpectedlyunable to connect to guest agent via vsock port 2222","time":"2023-12-01T01:51:03-05:00"}
balajiv113 commented 7 months ago

@terev Are you able to ssh into guest ?? If so, try to collect logs from sudo journalctl -u lima-guestagent.service -f

terev commented 7 months ago

@balajiv113 Oh thanks! Yeah I was just able to get a log by tailing while it happens:

sudo journalctl -u lima-guestagent.service -f -n 100
Dec 01 07:22:57 colima systemd[1]: Started lima-guestagent.service - lima-guestagent.
Dec 01 07:22:57 colima lima-guestagent[1126]: time="2023-12-01T07:22:57Z" level=info msg="event tick: 3s"
Dec 01 07:22:58 colima lima-guestagent[1126]: time="2023-12-01T07:22:58Z" level=info msg="serving the guest agent on vsock port: 2222"
Dec 01 07:23:01 colima systemd[1]: Stopping lima-guestagent.service - lima-guestagent...
Dec 01 07:23:01 colima systemd[1]: lima-guestagent.service: Deactivated successfully.
Dec 01 07:23:01 colima systemd[1]: Stopped lima-guestagent.service - lima-guestagent.
Dec 01 07:23:01 colima systemd[1]: Started lima-guestagent.service - lima-guestagent.
Dec 01 07:23:01 colima lima-guestagent[1820]: time="2023-12-01T07:23:01Z" level=info msg="event tick: 3s"
Dec 01 07:23:01 colima lima-guestagent[1820]: time="2023-12-01T07:23:01Z" level=info msg="serving the guest agent on vsock port: 2222"
Dec 01 07:30:14 colima systemd[1]: lima-guestagent.service: A process of this unit has been killed by the OOM killer.
Dec 01 07:30:14 colima systemd[1]: lima-guestagent.service: Main process exited, code=killed, status=9/KILL
Dec 01 07:30:14 colima systemd[1]: lima-guestagent.service: Failed with result 'oom-kill'.
Dec 01 07:30:14 colima systemd[1]: lima-guestagent.service: Consumed 47.606s CPU time.
Dec 01 07:30:14 colima systemd[1]: lima-guestagent.service: Scheduled restart job, restart counter is at 1.
Dec 01 07:30:14 colima systemd[1]: Stopped lima-guestagent.service - lima-guestagent.
Dec 01 07:30:14 colima systemd[1]: lima-guestagent.service: Consumed 47.606s CPU time.
Dec 01 07:30:14 colima systemd[1]: Started lima-guestagent.service - lima-guestagent.
Dec 01 07:30:44 colima systemd[1]: lima-guestagent.service: Main process exited, code=killed, status=9/KILL
Dec 01 07:30:44 colima systemd[1]: lima-guestagent.service: Failed with result 'signal'.
Dec 01 07:30:44 colima systemd[1]: lima-guestagent.service: Consumed 16.029s CPU time.
Dec 01 07:30:44 colima systemd[1]: lima-guestagent.service: Scheduled restart job, restart counter is at 2.
Dec 01 07:30:45 colima systemd[1]: Stopped lima-guestagent.service - lima-guestagent.
Dec 01 07:30:45 colima systemd[1]: lima-guestagent.service: Consumed 16.029s CPU time.
Dec 01 07:30:49 colima systemd[1]: Started lima-guestagent.service - lima-guestagent.
FATA[0531] exit status 255
balajiv113 commented 7 months ago

@terev Looks like this process was killed due to out of memory. Since guestagent is not running host was not able to connect to it

terev commented 7 months ago

@balajiv113 Should we maybe adjust the OOM score for the guest agent? It seems like a very important process and in this scenario it would have made more sense for the docker build processes to be killed. I'd be interested to see the memory consumption of this process though because its not expected to consume much memory right?

balajiv113 commented 7 months ago

I'd be interested to see the memory consumption of this process

@terev I doubt if guest agent was consuming more memory i think its the other process due to which guest agent was closed

terev commented 7 months ago

@balajiv113 yeah I watched top while running the build to reproduce, and the guest agent didn't take up much memory from what I saw.

maybe we should set the oom score adjustment to a negative number like docker does on the daemon https://docs.docker.com/config/containers/resource_constraints/#understand-the-risks-of-running-out-of-memory . this would make it less likely it's the process killed. though I still didn't figure out why the container processes using the majority of memory weren't killed.

louhisuo commented 4 months ago

I am experiencing same / similar issue. It is pretty frequent. I am using lima 0.20.1 and qemu 8.2.1. My use case is multi-node Kubernetes cluster and did not have this issue previously with older lima + qemu combination.