lxc / incus

Powerful system container and virtual machine manager
https://linuxcontainers.org/incus
Apache License 2.0
2.5k stars 203 forks source link

If image auto-update fails due to no space, it stays and corrupts new containers #404

Closed andrey-utkin closed 8 months ago

andrey-utkin commented 8 months ago

Required information

incus info ``` config: core.https_address: '[::]:8443' api_extensions: - storage_zfs_remove_snapshots - container_host_shutdown_timeout - container_stop_priority - container_syscall_filtering - auth_pki - container_last_used_at - etag - patch - usb_devices - https_allowed_credentials - image_compression_algorithm - directory_manipulation - container_cpu_time - storage_zfs_use_refquota - storage_lvm_mount_options - network - profile_usedby - container_push - container_exec_recording - certificate_update - container_exec_signal_handling - gpu_devices - container_image_properties - migration_progress - id_map - network_firewall_filtering - network_routes - storage - file_delete - file_append - network_dhcp_expiry - storage_lvm_vg_rename - storage_lvm_thinpool_rename - network_vlan - image_create_aliases - container_stateless_copy - container_only_migration - storage_zfs_clone_copy - unix_device_rename - storage_lvm_use_thinpool - storage_rsync_bwlimit - network_vxlan_interface - storage_btrfs_mount_options - entity_description - image_force_refresh - storage_lvm_lv_resizing - id_map_base - file_symlinks - container_push_target - network_vlan_physical - storage_images_delete - container_edit_metadata - container_snapshot_stateful_migration - storage_driver_ceph - storage_ceph_user_name - resource_limits - storage_volatile_initial_source - storage_ceph_force_osd_reuse - storage_block_filesystem_btrfs - resources - kernel_limits - storage_api_volume_rename - network_sriov - console - restrict_dev_incus - migration_pre_copy - infiniband - dev_incus_events - proxy - network_dhcp_gateway - file_get_symlink - network_leases - unix_device_hotplug - storage_api_local_volume_handling - operation_description - clustering - event_lifecycle - storage_api_remote_volume_handling - nvidia_runtime - container_mount_propagation - container_backup - dev_incus_images - container_local_cross_pool_handling - proxy_unix - proxy_udp - clustering_join - proxy_tcp_udp_multi_port_handling - network_state - proxy_unix_dac_properties - container_protection_delete - unix_priv_drop - pprof_http - proxy_haproxy_protocol - network_hwaddr - proxy_nat - network_nat_order - container_full - backup_compression - nvidia_runtime_config - storage_api_volume_snapshots - storage_unmapped - projects - network_vxlan_ttl - container_incremental_copy - usb_optional_vendorid - snapshot_scheduling - snapshot_schedule_aliases - container_copy_project - clustering_server_address - clustering_image_replication - container_protection_shift - snapshot_expiry - container_backup_override_pool - snapshot_expiry_creation - network_leases_location - resources_cpu_socket - resources_gpu - resources_numa - kernel_features - id_map_current - event_location - storage_api_remote_volume_snapshots - network_nat_address - container_nic_routes - cluster_internal_copy - seccomp_notify - lxc_features - container_nic_ipvlan - network_vlan_sriov - storage_cephfs - container_nic_ipfilter - resources_v2 - container_exec_user_group_cwd - container_syscall_intercept - container_disk_shift - storage_shifted - resources_infiniband - daemon_storage - instances - image_types - resources_disk_sata - clustering_roles - images_expiry - resources_network_firmware - backup_compression_algorithm - ceph_data_pool_name - container_syscall_intercept_mount - compression_squashfs - container_raw_mount - container_nic_routed - container_syscall_intercept_mount_fuse - container_disk_ceph - virtual-machines - image_profiles - clustering_architecture - resources_disk_id - storage_lvm_stripes - vm_boot_priority - unix_hotplug_devices - api_filtering - instance_nic_network - clustering_sizing - firewall_driver - projects_limits - container_syscall_intercept_hugetlbfs - limits_hugepages - container_nic_routed_gateway - projects_restrictions - custom_volume_snapshot_expiry - volume_snapshot_scheduling - trust_ca_certificates - snapshot_disk_usage - clustering_edit_roles - container_nic_routed_host_address - container_nic_ipvlan_gateway - resources_usb_pci - resources_cpu_threads_numa - resources_cpu_core_die - api_os - container_nic_routed_host_table - container_nic_ipvlan_host_table - container_nic_ipvlan_mode - resources_system - images_push_relay - network_dns_search - container_nic_routed_limits - instance_nic_bridged_vlan - network_state_bond_bridge - usedby_consistency - custom_block_volumes - clustering_failure_domains - resources_gpu_mdev - console_vga_type - projects_limits_disk - network_type_macvlan - network_type_sriov - container_syscall_intercept_bpf_devices - network_type_ovn - projects_networks - projects_networks_restricted_uplinks - custom_volume_backup - backup_override_name - storage_rsync_compression - network_type_physical - network_ovn_external_subnets - network_ovn_nat - network_ovn_external_routes_remove - tpm_device_type - storage_zfs_clone_copy_rebase - gpu_mdev - resources_pci_iommu - resources_network_usb - resources_disk_address - network_physical_ovn_ingress_mode - network_ovn_dhcp - network_physical_routes_anycast - projects_limits_instances - network_state_vlan - instance_nic_bridged_port_isolation - instance_bulk_state_change - network_gvrp - instance_pool_move - gpu_sriov - pci_device_type - storage_volume_state - network_acl - migration_stateful - disk_state_quota - storage_ceph_features - projects_compression - projects_images_remote_cache_expiry - certificate_project - network_ovn_acl - projects_images_auto_update - projects_restricted_cluster_target - images_default_architecture - network_ovn_acl_defaults - gpu_mig - project_usage - network_bridge_acl - warnings - projects_restricted_backups_and_snapshots - clustering_join_token - clustering_description - server_trusted_proxy - clustering_update_cert - storage_api_project - server_instance_driver_operational - server_supported_storage_drivers - event_lifecycle_requestor_address - resources_gpu_usb - clustering_evacuation - network_ovn_nat_address - network_bgp - network_forward - custom_volume_refresh - network_counters_errors_dropped - metrics - image_source_project - clustering_config - network_peer - linux_sysctl - network_dns - ovn_nic_acceleration - certificate_self_renewal - instance_project_move - storage_volume_project_move - cloud_init - network_dns_nat - database_leader - instance_all_projects - clustering_groups - ceph_rbd_du - instance_get_full - qemu_metrics - gpu_mig_uuid - event_project - clustering_evacuation_live - instance_allow_inconsistent_copy - network_state_ovn - storage_volume_api_filtering - image_restrictions - storage_zfs_export - network_dns_records - storage_zfs_reserve_space - network_acl_log - storage_zfs_blocksize - metrics_cpu_seconds - instance_snapshot_never - certificate_token - instance_nic_routed_neighbor_probe - event_hub - agent_nic_config - projects_restricted_intercept - metrics_authentication - images_target_project - cluster_migration_inconsistent_copy - cluster_ovn_chassis - container_syscall_intercept_sched_setscheduler - storage_lvm_thinpool_metadata_size - storage_volume_state_total - instance_file_head - instances_nic_host_name - image_copy_profile - container_syscall_intercept_sysinfo - clustering_evacuation_mode - resources_pci_vpd - qemu_raw_conf - storage_cephfs_fscache - network_load_balancer - vsock_api - instance_ready_state - network_bgp_holdtime - storage_volumes_all_projects - metrics_memory_oom_total - storage_buckets - storage_buckets_create_credentials - metrics_cpu_effective_total - projects_networks_restricted_access - storage_buckets_local - loki - acme - internal_metrics - cluster_join_token_expiry - remote_token_expiry - init_preseed - storage_volumes_created_at - cpu_hotplug - projects_networks_zones - network_txqueuelen - cluster_member_state - instances_placement_scriptlet - storage_pool_source_wipe - zfs_block_mode - instance_generation_id - disk_io_cache - amd_sev - storage_pool_loop_resize - migration_vm_live - ovn_nic_nesting - oidc - network_ovn_l3only - ovn_nic_acceleration_vdpa - cluster_healing - instances_state_total - auth_user - security_csm - instances_rebuild - numa_cpu_placement - custom_volume_iso - network_allocations - zfs_delegate - storage_api_remote_volume_snapshot_copy - operations_get_query_all_projects - metadata_configuration - syslog_socket - event_lifecycle_name_and_project - instances_nic_limits_priority - disk_initial_volume_configuration - operation_wait - image_restriction_privileged - cluster_internal_custom_volume_copy - disk_io_bus - storage_cephfs_create_missing - instance_move_config - ovn_ssl_config - certificate_description api_status: stable api_version: "1.0" auth: trusted public: false auth_methods: - tls auth_user_name: bluecherryteam auth_user_method: unix environment: addresses: - 192.168.86.151:8443 - 10.224.252.1:8443 - '[fd42:a96a:f32e:f14a::1]:8443' - 10.181.144.1:8443 - '[fd42:a046:4d35:f6dd::1]:8443' architectures: - x86_64 - i686 certificate: | -----BEGIN CERTIFICATE----- MIICDzCCAZWgAwIBAgIQXVoaCx77/vGcN5L2q6gljzAKBggqhkjOPQQDAzA3MRkw FwYDVQQKExBMaW51eCBDb250YWluZXJzMRowGAYDVQQDDBFyb290QGZvY2FsLTEt Ni0yNDAeFw0yNDAxMDYyMDUzNTBaFw0zNDAxMDMyMDUzNTBaMDcxGTAXBgNVBAoT EExpbnV4IENvbnRhaW5lcnMxGjAYBgNVBAMMEXJvb3RAZm9jYWwtMS02LTI0MHYw EAYHKoZIzj0CAQYFK4EEACIDYgAEI0LvCfJq47k1Jov/I7n+yXF9UqUtEFn2YNmA 0vpKE6Kgeon4zhQ1WLm1x2iz6yaWitnVdj/hTwK+FzQKZVNFDiW6ectxZxlMbyT+ 7+BePUedxm3XT+/2VsJeivWyU3wao2YwZDAOBgNVHQ8BAf8EBAMCBaAwEwYDVR0l BAwwCgYIKwYBBQUHAwEwDAYDVR0TAQH/BAIwADAvBgNVHREEKDAmggxmb2NhbC0x LTYtMjSHBH8AAAGHEAAAAAAAAAAAAAAAAAAAAAEwCgYIKoZIzj0EAwMDaAAwZQIw JFO7HPjo/RojG0vpv7C7UQGjw7X1m6vHpQa+aw+kR5zSDgv0qGxf09HBhmW7SDfk AjEArfaeKLzkqgwMQluRvLGeeQewxpBR7tuM/EC1WquYozt6jf/s1hRYN3Dja/+w Uyfs -----END CERTIFICATE----- certificate_fingerprint: e9c32ebfd473892cb5728b977929f8b45ddcd0adac7457685ee6098baa5af826 driver: lxc | qemu driver_version: 5.0.3 | 8.1.3 firewall: nftables kernel: Linux kernel_architecture: x86_64 kernel_features: idmapped_mounts: "true" netnsid_getifaddrs: "true" seccomp_listener: "true" seccomp_listener_continue: "true" uevent_injection: "true" unpriv_fscaps: "true" kernel_version: 5.15.0-91-generic lxc_features: cgroup2: "true" core_scheduling: "true" devpts_fd: "true" idmapped_mounts_v2: "true" mount_injection_file: "true" network_gateway_device_route: "true" network_ipvlan: "true" network_l2proxy: "true" network_phys_macvlan_mtu: "true" network_veth_router: "true" pidfd: "true" seccomp_allow_deny_syntax: "true" seccomp_notify: "true" seccomp_proxy_send_notify_fd: "true" os_name: Ubuntu os_version: "22.04" project: default server: incus server_clustered: false server_event_mode: full-mesh server_name: incus server_pid: 866 server_version: "0.4" storage: btrfs storage_version: 5.16.2 storage_supported_drivers: - name: dir version: "1" remote: false - name: lvm version: 2.03.11(2) (2021-01-08) / 1.02.175 (2021-01-08) / 4.45.0 remote: false - name: btrfs version: 5.16.2 remote: false ```

Issue description

When the storage pool is overfilled by image auto-update operation, the inconsistent image ends up being used. In my case it resulted in containers failing to start, but the bug can be less visible.

incusd.log:

time="2024-01-13T10:27:37Z" level=warning msg="Unpack failed" allowedCmds="[xz]" err="Failed to run: tar --wildcards --exclude=dev/* --exclude=./dev/* --exclude=rootfs/dev/* --exclude=rootfs/./dev/* --restrict --force-local -C /var/lib/incus/storage-pools/default/images/54df95801a0bdbdd981401884bbdec09f6b959170877df2e71f0677d3f220319 --numeric-owner --xattrs-include=* -Jxf -: exit status 2 (tar: metadata.yaml: Cannot write: No space left on device\ntar: templates/hostname.tpl: Cannot write: No space left on device\ntar: templates/hosts.tpl: Cannot write: No space left on device\ntar: Exiting with failure status due to previous errors)" extension=.tar.xz file=/var/lib/incus/images/54df95801a0bdbdd981401884bbdec09f6b959170877df2e71f0677d3f220319 path=/var/lib/incus/storage-pools/default/images/54df95801a0bdbdd981401884bbdec09f6b959170877df2e71f0677d3f220319

container-name/console.log:

/sbin/init: error while loading shared libraries: /lib/x86_64-linux-gnu/libseccomp.so.2: file too short

journalctl -u incus:

level=error msg="Failed to retrieve PID of executing child process" instance=huge-cluster-server-1 instanceType=container project=default

Steps to reproduce

  1. Pull some image from public repo, make sure it it configured to be auto-updated
  2. Make storage pool to be almost full
  3. Make image auto-update happen (e.g. wait; I don't know how to trigger it)
  4. Try to incus launch a container from previously downloaded image having auto-update on
  5. Observe it failing to start

Workaround

Manually delete all images from storage pool: incus image list, incus image delete ...

Information to attach

stgraber commented 8 months ago

What storage pool driver are you using?

andrey-utkin commented 8 months ago

btrfs