markrattray commented 3 months ago

Required information

Distribution: Ubuntu
Distribution version: Ubuntu Server 22.04

The output of "incus info" or if that fails:


config:
cluster.https_address: somenode.somedomain.com:8443
core.https_address: 192.168.1.5:8443
network.ovn.northbound_connection: tcp:192.168.1.5:6641,tcp:192.168.1.9:6641,tcp:192.168.1.11:6641,tcp:192.168.1.13:6641
storage.backups_volume: isp03/isb_somenode1
storage.images_volume: isp03/isi_somenode1
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- network_sriov
- console
- restrict_dev_incus
- migration_pre_copy
- infiniband
- dev_incus_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- dev_incus_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- backup_compression
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
- instance_get_full
- qemu_metrics
- gpu_mig_uuid
- event_project
- clustering_evacuation_live
- instance_allow_inconsistent_copy
- network_state_ovn
- storage_volume_api_filtering
- image_restrictions
- storage_zfs_export
- network_dns_records
- storage_zfs_reserve_space
- network_acl_log
- storage_zfs_blocksize
- metrics_cpu_seconds
- instance_snapshot_never
- certificate_token
- instance_nic_routed_neighbor_probe
- event_hub
- agent_nic_config
- projects_restricted_intercept
- metrics_authentication
- images_target_project
- images_all_projects
- cluster_migration_inconsistent_copy
- cluster_ovn_chassis
- container_syscall_intercept_sched_setscheduler
- storage_lvm_thinpool_metadata_size
- storage_volume_state_total
- instance_file_head
- instances_nic_host_name
- image_copy_profile
- container_syscall_intercept_sysinfo
- clustering_evacuation_mode
- resources_pci_vpd
- qemu_raw_conf
- storage_cephfs_fscache
- network_load_balancer
- vsock_api
- instance_ready_state
- network_bgp_holdtime
- storage_volumes_all_projects
- metrics_memory_oom_total
- storage_buckets
- storage_buckets_create_credentials
- metrics_cpu_effective_total
- projects_networks_restricted_access
- storage_buckets_local
- loki
- acme
- internal_metrics
- cluster_join_token_expiry
- remote_token_expiry
- init_preseed
- storage_volumes_created_at
- cpu_hotplug
- projects_networks_zones
- network_txqueuelen
- cluster_member_state
- instances_placement_scriptlet
- storage_pool_source_wipe
- zfs_block_mode
- instance_generation_id
- disk_io_cache
- amd_sev
- storage_pool_loop_resize
- migration_vm_live
- ovn_nic_nesting
- oidc
- network_ovn_l3only
- ovn_nic_acceleration_vdpa
- cluster_healing
- instances_state_total
- auth_user
- security_csm
- instances_rebuild
- numa_cpu_placement
- custom_volume_iso
- network_allocations
- zfs_delegate
- storage_api_remote_volume_snapshot_copy
- operations_get_query_all_projects
- metadata_configuration
- syslog_socket
- event_lifecycle_name_and_project
- instances_nic_limits_priority
- disk_initial_volume_configuration
- operation_wait
- image_restriction_privileged
- cluster_internal_custom_volume_copy
- disk_io_bus
- storage_cephfs_create_missing
- instance_move_config
- ovn_ssl_config
- certificate_description
- disk_io_bus_virtio_blk
- loki_config_instance
- instance_create_start
- clustering_evacuation_stop_options
- boot_host_shutdown_action
- agent_config_drive
- network_state_ovn_lr
- image_template_permissions
- storage_bucket_backup
- storage_lvm_cluster
- shared_custom_block_volumes
- auth_tls_jwt
- oidc_claim
- device_usb_serial
- numa_cpu_balanced
- image_restriction_nesting
- network_integrations
- instance_memory_swap_bytes
- network_bridge_external_create
- network_zones_all_projects
- storage_zfs_vdev
- container_migration_stateful
- profiles_all_projects
- instances_scriptlet_get_instances
- instances_scriptlet_get_cluster_members
- instances_scriptlet_get_project
- network_acl_stateless
- instance_state_started_at
- networks_all_projects
- network_acls_all_projects
- storage_buckets_all_projects
- resources_load
- instance_access
- project_access
- projects_force_delete
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
auth_user_name: someadmin
auth_user_method: unix
environment:
addresses:
- 192.168.1.5:8443
architectures:
- x86_64
- i686
certificate: |
-----BEGIN CERTIFICATE-----
somcert
-----END CERTIFICATE-----
certificate_fingerprint: somefingerprint
driver: lxc | qemu
driver_version: 6.0.1 | 9.0.1
firewall: nftables
kernel: Linux
kernel_architecture: x86_64
kernel_features:
idmapped_mounts: "true"
netnsid_getifaddrs: "true"
seccomp_listener: "true"
seccomp_listener_continue: "true"
uevent_injection: "true"
unpriv_binfmt: "false"
unpriv_fscaps: "true"
kernel_version: 6.5.0-41-generic
lxc_features:
cgroup2: "true"
core_scheduling: "true"
devpts_fd: "true"
idmapped_mounts_v2: "true"
mount_injection_file: "true"
network_gateway_device_route: "true"
network_ipvlan: "true"
network_l2proxy: "true"
network_phys_macvlan_mtu: "true"
network_veth_router: "true"
pidfd: "true"
seccomp_allow_deny_syntax: "true"
seccomp_notify: "true"
seccomp_proxy_send_notify_fd: "true"
os_name: Ubuntu
os_version: "22.04"
project: someproject
server: incus
server_clustered: true
server_event_mode: full-mesh
server_name: somenode1
server_pid: 3302431
server_version: "6.2"
storage: zfs
storage_version: 2.1.5-1ubuntu6~22.04.4
storage_supported_drivers:
- name: dir
version: "1"
remote: false
- name: lvm
version: 2.03.11(2) (2021-01-08) / 1.02.175 (2021-01-08) / 4.48.0
remote: false
- name: lvmcluster
version: 2.03.11(2) (2021-01-08) / 1.02.175 (2021-01-08) / 4.48.0
remote: true
- name: zfs
version: 2.1.5-1ubuntu6~22.04.4
remote: false
- name: btrfs
version: 5.16.2
remote: false

   * Storage backend in use: Incus managed ZFS on local disks.

# Issue description

At the moment we are still using macvlan type NICs with Incus instances until we're able to move to OVN.

Sometimes VMs (mostly Windows) are unable to start because the previous state of the virtual NIC ( `volatile.eth0.host_name` ) is still bound to the Incus node's parent NIC. 

It's not limited to one Incus node and the software installed in them are quite different:
  * incus node 01: MS SQL server with additional block device on enterprise grade NVMe for data, root disk on enterprise grade
  * incus node 03: MS IIS server with no additional devices, root disk on enterprise grade RAID10 SSDs
  * Both are running WS2022
  * I tested another instance created from the same image and it has 4 additional block devices in it, and after an update and reboot, it came up fine.

This happened before under LXD and was partially resolved:

incus start {instance-name} Error: Failed to start device "eth0": Failed adding link: Failed to run: ip link add name mace74a984e link br0 address 00:16:3e:24:a3:7a allmulticast on up type macvtap mode bridge: exit status 2 (RTNETLINK answers: Address already in use)

The manual steps to get the VM back up and running are:
  1. find the Incus node that the instance is running on
  2. on that node run: `ip link show | grep -B 1 '{instance-mac-address}'`
  3. delete the virtual NIC: `sudo ip link delete mac0f01152c`
  4. start the instance

My request is to get Incus to the following during the VM startup process:
  1. get the instance's MAC address/s
  2. check the Incus node's IP stack for lingering virtual NICs with the MACs and delete them
  3. continue with the VM startup

I think the previous attempts at fixing this may have been too granular and perhaps there might not be a need to fix multiple scenarios where this can happen, so maybe it just needs to clean up on every start because I have observed that these virtual NICs change at every startup. Also I might not be aware of other scenarios where there will be problems with what I'm suggesting.

Thanks

# Steps to reproduce

 1. VM might crash or reboot (in this scenario 2 affected VMs rebooted after Windows updates)
 2. Doesn't come up
 3. Try start manually and observe the error: `Address already in use`

# Information to attach

 - [x] Any relevant kernel output (`dmesg`)
 none
 - [ ] Container log (`incus info NAME --show-log`)
 - [ ] Container configuration (`incus config show NAME --expanded`)
 - [x] Main daemon log (at /var/log/incus/incusd.log)

This is the only entry around the time. time="2024-07-11T00:44:15Z" level=error msg="Failed to cleanly stop instance" err="Failed to start device \"eth0\": Failed adding link: Failed to run: ip link add name mac450200c0 link br0 address 00:16:3e:24:a3:7a allmulticast on up type macvtap mode bridge: exit status 2 (RTNETLINK answers: Address already in use)" instance=someinstance instanceType=virtual-machine project=someproject


 - [ ] Output of the client with --debug
 - [ ] Output of the daemon with --debug (alternatively output of `incus monitor --pretty` while reproducing the issue)

stgraber commented 3 months ago

Iterating over all the host interfaces to try to clean up potential conflicts shouldn't be needed and may actually be dangerous as it's perfectly valid in some environments to have the same MAC on multiple interfaces and starting to arbitrarily delete them may just cause a whole bunch of issues.

I spent around 30min trying to reproduce the issue you're describing, both by killing QEMU to simulate a hard crash and by triggering reboots from within a VM as seems to be the trigger for you, but I never managed to get the issue to happen here, so we're going to need some kind of somewhat reliable reproducer.

Looking at the macvlan nic cleanup logic, I'm not seeing anything wrong in there. As soon as the VM comes down, it triggers the onStop action which then iterates over all the devices on the instance and calls their Stop command. In the macvlan case, this will return a function that will delete the host device. I also did a test build here to make sure that code path is properly being hit during an instance initiated reboot and it did hit.

stgraber commented 3 months ago

If you can reproduce this somewhat reliably with a VM, it'd be good to run incus monitor --pretty on the system it's running, then reboot the VM and see it hit the issue. That should show us a better trace of all the calls being made.

Having the full incus config show --expanded output for an affected VM would also help as it's certainly possible that other devices or configuration are impacting this.

markrattray commented 3 months ago

Good morning. Sorry had a few emergencies so been away. Thank you for your efforts and checking out all this.

It's a bit random unfortunately and I've been rebooting VMs based on the same image regularly. The problematic ones did have a lot more workload than what I was rebooting. I'm working this Sunday so I'll see if I can reproduce the scenario again.

It might have something to do with the network setup on these hosts.... OVN wanted a dedicated NIC or a bridge, so to test OVN I deployed a bridge then OVN on a single NIC, but still using macvlan NICs for instances due to a routing issue to/from external networks and routed OVN networks.

stgraber commented 3 weeks ago

@markrattray did you have any luck on reproducing this somewhat reliably?

markrattray commented 3 weeks ago

Good morning.

I'll close this now because it hasn't happened in a while. It might have had something to do with the post cluster upgrade issue that you fixed for us, where we had FQDN to localhost entries in the hosts file. which caused issues on this cluster.

markrattray commented 3 weeks ago

Issue as not reoccurred in a while.

lxc / incus

Request to clean up instance's previous state's virtual NIC from host's stack. #983

Required information