lxc / incus

Powerful system container and virtual machine manager
https://linuxcontainers.org/incus
Apache License 2.0
2.43k stars 190 forks source link

Improve handling and documentation of auto-healing #1032

Closed bughunter2 closed 1 month ago

bughunter2 commented 1 month ago

Required information

incus info
config:
  cluster.healing_threshold: "30"
  cluster.https_address: incus-n1:8443
  core.https_address: incus-n1:8443
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- network_sriov
- console
- restrict_dev_incus
- migration_pre_copy
- infiniband
- dev_incus_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- dev_incus_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- backup_compression
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
- instance_get_full
- qemu_metrics
- gpu_mig_uuid
- event_project
- clustering_evacuation_live
- instance_allow_inconsistent_copy
- network_state_ovn
- storage_volume_api_filtering
- image_restrictions
- storage_zfs_export
- network_dns_records
- storage_zfs_reserve_space
- network_acl_log
- storage_zfs_blocksize
- metrics_cpu_seconds
- instance_snapshot_never
- certificate_token
- instance_nic_routed_neighbor_probe
- event_hub
- agent_nic_config
- projects_restricted_intercept
- metrics_authentication
- images_target_project
- images_all_projects
- cluster_migration_inconsistent_copy
- cluster_ovn_chassis
- container_syscall_intercept_sched_setscheduler
- storage_lvm_thinpool_metadata_size
- storage_volume_state_total
- instance_file_head
- instances_nic_host_name
- image_copy_profile
- container_syscall_intercept_sysinfo
- clustering_evacuation_mode
- resources_pci_vpd
- qemu_raw_conf
- storage_cephfs_fscache
- network_load_balancer
- vsock_api
- instance_ready_state
- network_bgp_holdtime
- storage_volumes_all_projects
- metrics_memory_oom_total
- storage_buckets
- storage_buckets_create_credentials
- metrics_cpu_effective_total
- projects_networks_restricted_access
- storage_buckets_local
- loki
- acme
- internal_metrics
- cluster_join_token_expiry
- remote_token_expiry
- init_preseed
- storage_volumes_created_at
- cpu_hotplug
- projects_networks_zones
- network_txqueuelen
- cluster_member_state
- instances_placement_scriptlet
- storage_pool_source_wipe
- zfs_block_mode
- instance_generation_id
- disk_io_cache
- amd_sev
- storage_pool_loop_resize
- migration_vm_live
- ovn_nic_nesting
- oidc
- network_ovn_l3only
- ovn_nic_acceleration_vdpa
- cluster_healing
- instances_state_total
- auth_user
- security_csm
- instances_rebuild
- numa_cpu_placement
- custom_volume_iso
- network_allocations
- zfs_delegate
- storage_api_remote_volume_snapshot_copy
- operations_get_query_all_projects
- metadata_configuration
- syslog_socket
- event_lifecycle_name_and_project
- instances_nic_limits_priority
- disk_initial_volume_configuration
- operation_wait
- image_restriction_privileged
- cluster_internal_custom_volume_copy
- disk_io_bus
- storage_cephfs_create_missing
- instance_move_config
- ovn_ssl_config
- certificate_description
- disk_io_bus_virtio_blk
- loki_config_instance
- instance_create_start
- clustering_evacuation_stop_options
- boot_host_shutdown_action
- agent_config_drive
- network_state_ovn_lr
- image_template_permissions
- storage_bucket_backup
- storage_lvm_cluster
- shared_custom_block_volumes
- auth_tls_jwt
- oidc_claim
- device_usb_serial
- numa_cpu_balanced
- image_restriction_nesting
- network_integrations
- instance_memory_swap_bytes
- network_bridge_external_create
- network_zones_all_projects
- storage_zfs_vdev
- container_migration_stateful
- profiles_all_projects
- instances_scriptlet_get_instances
- instances_scriptlet_get_cluster_members
- instances_scriptlet_get_project
- network_acl_stateless
- instance_state_started_at
- networks_all_projects
- network_acls_all_projects
- storage_buckets_all_projects
- resources_load
- instance_access
- project_access
- projects_force_delete
- resources_cpu_flags
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
auth_user_name: root
auth_user_method: unix
environment:
  addresses:
  - incus-n1:8443
  architectures:
  - x86_64
  - i686
  certificate: |
    -----BEGIN CERTIFICATE-----
    MIIB+zCCAYGgAwIBAgIRAPa0HosBkerO9ODH9mebPQQwCgYIKoZIzj0EAwMwMDEZ
    MBcGA1UEChMQTGludXggQ29udGFpbmVyczETMBEGA1UEAwwKcm9vdEB2bWRlYjAe
    Fw0yNDA3MTUxODEyMDJaFw0zNDA3MTMxODEyMDJaMDAxGTAXBgNVBAoTEExpbnV4
    IENvbnRhaW5lcnMxEzARBgNVBAMMCnJvb3RAdm1kZWIwdjAQBgcqhkjOPQIBBgUr
    gQQAIgNiAARMTBgHeffaqSsjRtuM0UxfAUjBBIWSYG5MEf97/KXStl/pvQqM5QhD
    9+nKbcOPEfmkByMKs6TdjVu4WVFatOqitcX7tjSFoTVBidZd+zBX0OfMIdiR41Pu
    sFH3liD9vRejXzBdMA4GA1UdDwEB/wQEAwIFoDATBgNVHSUEDDAKBggrBgEFBQcD
    ATAMBgNVHRMBAf8EAjAAMCgGA1UdEQQhMB+CBXZtZGVihwR/AAABhxAAAAAAAAAA
    AAAAAAAAAAABMAoGCCqGSM49BAMDA2gAMGUCMQCWUy1S/HvkdA96CzHIMeKA7arl
    I4Ia3udfdBLGthkGNiFqJHfsV0iMik7mFWppMR8CMB1za36V5p7WgdgafYzXzc9T
    yGO7C6bIFm0Uzr9rhSNzg4gTVj4r74rH5vpxpBEAcQ==
    -----END CERTIFICATE-----
  certificate_fingerprint: b7e78011783c5586e22384d3b08e825d0b062015ea86c9b88400e2b80497e0b2
  driver: lxc
  driver_version: 5.0.2
  firewall: xtables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    idmapped_mounts: "true"
    netnsid_getifaddrs: "false"
    seccomp_listener: "true"
    seccomp_listener_continue: "true"
    uevent_injection: "true"
    unpriv_binfmt: "false"
    unpriv_fscaps: "true"
  kernel_version: 6.1.0-20-amd64
  lxc_features:
    cgroup2: "true"
    core_scheduling: "true"
    devpts_fd: "true"
    idmapped_mounts_v2: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: Debian GNU/Linux
  os_version: "12"
  project: default
  server: incus
  server_clustered: true
  server_event_mode: full-mesh
  server_name: incus-n1
  server_pid: 1762
  server_version: 6.0.1
  storage: dir
  storage_version: "1"
  storage_supported_drivers:
  - name: ceph
    version: 16.2.11
    remote: true
  - name: cephfs
    version: 16.2.11
    remote: true
  - name: cephobject
    version: 16.2.11
    remote: true
  - name: dir
    version: "1"
    remote: false
  - name: lvm
    version: 2.03.16(2) (2022-05-18) / 1.02.185 (2022-05-18) / 4.47.0
    remote: false
  - name: lvmcluster
    version: 2.03.16(2) (2022-05-18) / 1.02.185 (2022-05-18) / 4.47.0
    remote: true
  - name: btrfs
    version: "6.2"
    remote: false

Issue description

Filesystem corruption occurs in the following case: If an Incus cluster member becomes unreachable (i.e., due to a partition in the network), automatic evacuation may happen (if enabled). However, once the Incus cluster member becomes connected to the other members again, filesystem corruption can occur on any container that was running on that member. While those containers have indeed been migrated thanks to the automatic evacuation, they are also still running (invisibly) on the Incus member that became unreachable. That's one of the causes of the filesystem corruption. Some filesystem corruption may be inevitable, simply because the containers were running and then became disconnected from the network. That's acceptable. However, further filesystem corruption can occur because the containers actually keep running (invisibly) on the Incus member. Further details below.

Steps to reproduce

This is an example I ran in QEMU/KVM.

The context: an Incus cluster with 3 nodes and also a Ceph cluster on the same 3 nodes.

The cluster members are called: incus-n1, incus-n2, incus-n3 Steps to trigger filesystem corruption:

Automatic evacuation is enabled:

incus config set cluster.healing_threshold 30

Now disconnect incus-n1 from the network.

Wait until automatic evacuation happens.

Make some changes on the automatically migrated launched container, such as writing something to /root/.bashrc In my case, after doing that, I ran sync to flush the I/O buffer to disk.

Connect incus-n1 again to the network.

Even though incus-n1 agrees about the cluster's state (that debian0 now really runs on incus-n2), there's still an invisible container running on incus-n1, which was the previous debian0 container before the automatic migration took place.

In the Incus cluster, stop the real debian0 container, the one that runs on incus-n2. Issuing that stop command indeed stops the container on incus-n2, as expected. Afterwards, Incus reports the container's state as STOPPED. However, on incus-n1, that container is still running (visible via ps auxfww), even though Incus doesn't report it. I don't expect Incus to see the container, since it has already been migrated. But this situation can cause filesystem corruption if we are using distributed storage like Ceph.

(If you're lucky, it might not cause corruption. But it seems more likely that it will cause filesystem corruption.)

In the output of ps auxfww we can see that the container is indeed still running: I've referred to this earlier as the invisible container.

root@incus-n1:~# ps auxfww | grep 'lxc monitor' -A40
root        3802  0.0  0.1   6332  2068 pts/3    S+   22:43   0:00                      \_ grep --color=auto lxc monitor -A40
root        1762  0.7  3.9 6178320 78780 ?       Ssl  21:47   0:24 /usr/libexec/incus/incusd --group incus-admin --logfile=/var/log/incus/incus.log
root        3078  0.0  0.6 5777728 12784 ?       Ss   22:10   0:00 [lxc monitor] /var/lib/incus/containers debian0
1000000     3088  0.0  0.4  99776  9720 ?        Ss   22:10   0:00  \_ /sbin/init
1000000     3216  0.0  0.6  32004 12760 ?        Ss   22:10   0:00      \_ /lib/systemd/systemd-journald
1000000     3228  0.0  0.1  20612  4004 ?        Ss   22:10   0:00      \_ /lib/systemd/systemd-udevd
1000103     3230  0.0  0.1   8272  3840 ?        Ss   22:10   0:00      \_ /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
1000000     3233  0.0  0.2  13400  5572 ?        Ss   22:10   0:00      \_ /lib/systemd/systemd-logind
1000101     3234  0.0  0.3  16048  7092 ?        Ss   22:10   0:00      \_ /lib/systemd/systemd-networkd
1000102     3237  0.0  0.4  21204  8220 ?        Ss   22:10   0:00      \_ /lib/systemd/systemd-resolved
1000000     3239  0.0  0.0   5476  1596 pts/0    Ss+  22:10   0:00      \_ /sbin/agetty -o -p -- \u --noclear --keep-baud console 115200,38400,9600 linux

Unfortunately, this can cause filesystem corruption, since we use Ceph. This is because the Ceph RBD was still mounted on incus-n1, and the invisible container wrote data to it later on, when incus-n1 got reconnected to the network again.

Running e2fsck reveals the container filesystem corruption.

First, I made sure to reboot the Incus member on which the invisible container was running. I also made sure the container wasn't running anywhere.

Then I mapped the Ceph RBD and ran e2fsck, like so:

root@incus-n2:~# rbd ls incus-ceph-rbd
container_debian0
image_769ab88758a08d25b7f0657d399d962cc94033dee30e240241438013ca1b0ef4_ext4
incus_incus-ceph-rbd
root@incus-n2:~# rbd map incus-ceph-rbd/container_debian0
/dev/rbd0
root@incus-n2:~# e2fsck /dev/rbd0
e2fsck 1.47.0 (5-Feb-2023)
/dev/rbd0: clean, 12735/655360 files, 176285/2621440 blocks
root@incus-n2:~# e2fsck -f /dev/rbd0
e2fsck 1.47.0 (5-Feb-2023)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Entry '.bashrc' in /rootfs/root (131400) has deleted/unused inode 143797.  Clear<y>? no
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Inode 131401 ref count is 1, should be 2.  Fix<y>? no
Unattached inode 143790
Connect to /lost+found<y>? no
Unattached inode 143791
Connect to /lost+found<y>? no
Pass 5: Checking group summary information
Block bitmap differences:  +533761 -533767
Fix<y>? no

/dev/rbd0: ********** WARNING: Filesystem still has errors **********

/dev/rbd0: 12735/655360 files (0.1% non-contiguous), 176285/2621440 blocks

Although I think this is worth reporting, I'm not sure what Incus can do about it. Note that the above is only a problem if we are using remote storage. If the container was using local storage, there wouldn't have been any filesystem corruption. For Incus that will mean it only has to handle this case when it's part of an Incus cluster and the container is using remote storage like Ceph.

What could the incus-n1 node have done? That's the node we disconnected from the network to simulate a real-world problem. To make matters even more complicated, situations can arise whereby one or more Incus nodes might have lost contact with each other, but the Ceph nodes haven't, for whatever reason (for example because the Ceph nodes could be located elsewhere instead of running on the same Incus nodes). From the standpoint of container/VM availability, it would be desirable if containers/VMs kept running. But that might mean that Incus can't ever handle this situation in a way that prevents filesystem corruption if automatic evacuation is enabled.

So I guess this boils down to the CAP theorem and the CP vs. AP choice. I'd like to avoid filesystem corruption and hence favor consistency.

If Incus wants to handle this, it might mean that the incus-n1 node has to decide that it is the problematic node (even though it can't know for sure) and hope that the other nodes still are in quorum. Then, the incus-n1 node could decide to forcefully stop the container, thereby preventing future writes in case the incus-n1 re-establishes contact with the other nodes. This would prevent the above-mentioned filesystem corruption.

To emphasize, some filesystem corruption is understandable, since containers that are running suddenly become disconnected from their remote storage provider (Ceph). However, further container filesystem corruption occurs once the member is connected to the network again, because the container keeps running (invisibly) on that Incus member after the container has been migrated by automatic evacuation.

Just thinking out loud here. I don't have a solution, per se.

One question is how to detect the situation. The other question is how to handle it.

Information to attach

stgraber commented 1 month ago

It's something I've been thinking about a bit those past few days actually, sadly it's also not something we can do too much about unfortunately.

I see a few things that we should do to improve things a bit though:

The ICMP check should help with false positives and the last item will make it easier for someone to implement a STONITH type mechanism around it. Basically you'd have a small daemon running on a management system monitoring lifecycle events coming from your cluster. When a server is marked as defective and auto-healing is triggered, that daemon connects to the BMC or PDU and cuts power to the dead server.

It's essentially the only way to handle this as an even partly disconnected server will not be able to kill off the running containers/VMs without either causing immediate writes (if storage is still available somehow) or hanging until storage is available again, then causing writes at that point.

The only way to prevent any concurrent writes and limit the damage to any data that wasn't written yet, is by cutting power to the machine.

bughunter2 commented 1 month ago

Another possible problem is IP conflicts when containers/VMs use static IP addressing and they keep running (invisibly) on the original Incus member after automatic evacuation has taken place.

As for filesystem corruption: it may be useful for users to know this can be prevented by using local storage and instance backups instead (backups should always be made regardless), although this of course prevents the use of automatic evacuation entirely. It's a trade-off, and the choice will have to be evaluated per cluster environment.