Closed markrattray closed 3 weeks ago
Iterating over all the host interfaces to try to clean up potential conflicts shouldn't be needed and may actually be dangerous as it's perfectly valid in some environments to have the same MAC on multiple interfaces and starting to arbitrarily delete them may just cause a whole bunch of issues.
I spent around 30min trying to reproduce the issue you're describing, both by killing QEMU to simulate a hard crash and by triggering reboots from within a VM as seems to be the trigger for you, but I never managed to get the issue to happen here, so we're going to need some kind of somewhat reliable reproducer.
Looking at the macvlan nic cleanup logic, I'm not seeing anything wrong in there. As soon as the VM comes down, it triggers the onStop
action which then iterates over all the devices on the instance and calls their Stop
command. In the macvlan case, this will return a function that will delete the host device. I also did a test build here to make sure that code path is properly being hit during an instance initiated reboot and it did hit.
If you can reproduce this somewhat reliably with a VM, it'd be good to run incus monitor --pretty
on the system it's running, then reboot the VM and see it hit the issue. That should show us a better trace of all the calls being made.
Having the full incus config show --expanded
output for an affected VM would also help as it's certainly possible that other devices or configuration are impacting this.
Good morning. Sorry had a few emergencies so been away. Thank you for your efforts and checking out all this.
It's a bit random unfortunately and I've been rebooting VMs based on the same image regularly. The problematic ones did have a lot more workload than what I was rebooting. I'm working this Sunday so I'll see if I can reproduce the scenario again.
It might have something to do with the network setup on these hosts.... OVN wanted a dedicated NIC or a bridge, so to test OVN I deployed a bridge then OVN on a single NIC, but still using macvlan NICs for instances due to a routing issue to/from external networks and routed OVN networks.
@markrattray did you have any luck on reproducing this somewhat reliably?
Good morning.
I'll close this now because it hasn't happened in a while. It might have had something to do with the post cluster upgrade issue that you fixed for us, where we had FQDN to localhost entries in the hosts file. which caused issues on this cluster.
Issue as not reoccurred in a while.
Required information
incus start {instance-name} Error: Failed to start device "eth0": Failed adding link: Failed to run: ip link add name mace74a984e link br0 address 00:16:3e:24:a3:7a allmulticast on up type macvtap mode bridge: exit status 2 (RTNETLINK answers: Address already in use)
This is the only entry around the time. time="2024-07-11T00:44:15Z" level=error msg="Failed to cleanly stop instance" err="Failed to start device \"eth0\": Failed adding link: Failed to run: ip link add name mac450200c0 link br0 address 00:16:3e:24:a3:7a allmulticast on up type macvtap mode bridge: exit status 2 (RTNETLINK answers: Address already in use)" instance=someinstance instanceType=virtual-machine project=someproject