lima-vm / socket_vmnet

vmnet.framework support for unmodified rootless QEMU (no dependency on VDE)
Apache License 2.0
99 stars 17 forks source link

socket_vmnet failing on M1 (`start(): vmnet_return_t VMNET_FAILURE`) #7

Open jandubois opened 2 years ago

jandubois commented 2 years ago

I've now observed the error from https://github.com/lima-vm/lima/pull/1049 two more times (qemu failing to start up because fd_connect throws an error). Both times have been on an M1 mini; I cannot remember if the bug report on the lima repo was also based on a failure on M1, or if it was Intel.

Unfortunately I've been running with lima 0.12.0, which doesn't have the error reporting fix. However, I can see errors in the daemon logs (after qemu failed):

jan@zilicon _networks % cat rancher-desktop-shared_socket_vmnet.stderr.log
start(): vmnet_return_t VMNET_FAILURE
start: Undefined error: 0
jan@zilicon _networks % cat rancher-desktop-shared_socket_vmnet.stdout.log
Initializing vmnet.framework (mode 1001)
jan@zilicon _networks % cat rancher-desktop-bridged_en0_socket_vmnet.stderr.log
on_accept(): vmnet_return_t VMNET_INVALID_ARGUMENT
vmnet_write: Undefined error: 0

The bridged network was running, but the shared network was not.

The only way I found to get things working again was by rebooting the machine.

AkihiroSuda commented 2 years ago

Does this error happen with vde_vmnet too? The vmnet code are almost unchanged from vde_vmnet.

jandubois commented 2 years ago

Does this error happen with vde_vmnet too?

It is possible, but I haven't seen it. One difference is that with socket_vmnet the failure is catastrophic: qemu will not start the VM. With vde_vmnet you would just not get an IP address on the interface, so you might not notice unless you use the external IP address for ingress.

We have seen on Rancher Desktop that some users don't get an IP address in specific environments, but have never been able to determine the reason for it. Maybe it is related, but I don't know. We detect this and configure flannel with the SLIRP interface when that happens, so things are still working with reduced functionality in that case.

jandubois commented 2 years ago

It is possible, but I haven't seen it.

All the failures I've seen last week were on a remote M1 mini that is running inside the Vancouver office, so it is a different environment from what I regularly use. However, the failures were not immediate, or frequent, but just once a day after restarting VMs (and daemons) multiple times. The machine was running Big Sur, whereas my regular Intel machine is running Catalina.

medyagh commented 1 year ago

i have also noticed this, changing my location and (different wifi) have caused problems that I was able to fix only by uninstalling and rebooting and installing.

mprimeaux commented 1 year ago

A bit more information is I can confirm my DHCP 'server' is allocating the DHCP address to socket_vmnet as I receive a 'new device detected' alert from my firewall.

Tailing the stderr shows the same errors as reported by @jandubois.

on_accept(): vmnet_return_t VMNET_INVALID_ARGUMENT
vmnet_write: Undefined error: 0
on_accept(): vmnet_return_t VMNET_INVALID_ARGUMENT
vmnet_write: Undefined error: 0
on_accept(): vmnet_return_t VMNET_INVALID_ARGUMENT
vmnet_write: Undefined error: 0
on_accept(): vmnet_return_t VMNET_INVALID_ARGUMENT
vmnet_write: Undefined error: 0
on_accept(): vmnet_return_t VMNET_INVALID_ARGUMENT
vmnet_write: Undefined error: 0

I'm running macOS Ventura 13.1 (22C65).

ProjectJYL commented 1 year ago

I ran into a similar issue. With socket mode instead of shared mode because the socket_vmnet is "unmanaged" meaning it's started or stopped by brew services. First time starting VMs for the day worked fine. After a couple of minutes, the VM network went into unreachable state. Was not able to start the VM after it's stopped.

ha.stderr.log

{"level":"debug","msg":"QEMU version 8.0.2 detected","time":"2023-07-18T13:28:12-04:00"}
{"level":"debug","msg":"firmware candidates = [/Users/jylee/.local/share/qemu/edk2-aarch64-code.fd /opt/homebrew/share/qemu/edk2-aarch64-code.fd /usr/share/AAVMF/AAVMF_CODE.fd /usr/share/qemu-efi-aarch64/QEMU_EFI.fd]","time":"2023-07-18T13:28:12-04:00"}
{"level":"fatal","msg":"template: :1:21: executing \"\" at \u003cfd_connect \"/opt/homebrew/var/run/socket_vmnet\"\u003e: error calling fd_connect: fd_connect: dial unix /opt/homebrew/var/run/socket_vmnet: connect: connection refused","time":"2023-07-18T13:28:12-04:00"}

The socket_vmnet service itself shows

% sudo brew services list
Name         Status     User File
socket_vmnet error  256 root /Library/LaunchDaemons/homebrew.mxcl.socket_vmnet.plist
unbound      none

and /opt/homebrew/var/log/socket_vmnet/stderr shows some iterations of these logs

vmnet_write: Bad file descriptor
writev: Bad file descriptor
writev: Broken pipe
writev: Broken pipe
writev: Broken pipe
writev: Broken pipe
writev: Broken pipe
writev: Broken pipe
writev: Broken pipe
writev: Broken pipe
writev: Broken pipe
on_accept(): vmnet_return_t VMNET_INVALID_ARGUMENT
vmnet_write: Broken pipe

To restore the network, I had to restart the socket_vmnet service and all the VMs. After a while, this problem repeats. Is there any other workaround to this?

By the way, this doesn't just happen on "socket" mode in case you're wondering. It happened on "shared" mode where socket_vmnet is managed by lima.

I have M1 Macbook Pro on MacOS Ventura 13.4.1. socket_vmnet 1.1.2.

saintdle commented 1 year ago

I'm seeing the same behaviour on Mac OS X 13.4.1(c) M2 - socket_vmnet 1.1.2 - lima 0.16.0

I can build/start a new VM, soon as I stop it, I see the same behaviour as described here with the same stderr outputs. The difference here is that I don't see an error on the service, as it's not running I can't restart it.

sudo brew services
Name         Status User File
socket_vmnet none   

Only fix I've found so far, is to uninstall socket_vmnet and reinstall it.