lxc / incus

Powerful system container and virtual machine manager
https://linuxcontainers.org/incus
Apache License 2.0
2.66k stars 220 forks source link

Add support for DHCP renewals for OCI containers #1106

Closed moritzruth closed 3 days ago

moritzruth commented 2 months ago

Required information

Output of `incus info` ```yaml config: {} api_extensions: - storage_zfs_remove_snapshots - container_host_shutdown_timeout - container_stop_priority - container_syscall_filtering - auth_pki - container_last_used_at - etag - patch - usb_devices - https_allowed_credentials - image_compression_algorithm - directory_manipulation - container_cpu_time - storage_zfs_use_refquota - storage_lvm_mount_options - network - profile_usedby - container_push - container_exec_recording - certificate_update - container_exec_signal_handling - gpu_devices - container_image_properties - migration_progress - id_map - network_firewall_filtering - network_routes - storage - file_delete - file_append - network_dhcp_expiry - storage_lvm_vg_rename - storage_lvm_thinpool_rename - network_vlan - image_create_aliases - container_stateless_copy - container_only_migration - storage_zfs_clone_copy - unix_device_rename - storage_lvm_use_thinpool - storage_rsync_bwlimit - network_vxlan_interface - storage_btrfs_mount_options - entity_description - image_force_refresh - storage_lvm_lv_resizing - id_map_base - file_symlinks - container_push_target - network_vlan_physical - storage_images_delete - container_edit_metadata - container_snapshot_stateful_migration - storage_driver_ceph - storage_ceph_user_name - resource_limits - storage_volatile_initial_source - storage_ceph_force_osd_reuse - storage_block_filesystem_btrfs - resources - kernel_limits - storage_api_volume_rename - network_sriov - console - restrict_dev_incus - migration_pre_copy - infiniband - dev_incus_events - proxy - network_dhcp_gateway - file_get_symlink - network_leases - unix_device_hotplug - storage_api_local_volume_handling - operation_description - clustering - event_lifecycle - storage_api_remote_volume_handling - nvidia_runtime - container_mount_propagation - container_backup - dev_incus_images - container_local_cross_pool_handling - proxy_unix - proxy_udp - clustering_join - proxy_tcp_udp_multi_port_handling - network_state - proxy_unix_dac_properties - container_protection_delete - unix_priv_drop - pprof_http - proxy_haproxy_protocol - network_hwaddr - proxy_nat - network_nat_order - container_full - backup_compression - nvidia_runtime_config - storage_api_volume_snapshots - storage_unmapped - projects - network_vxlan_ttl - container_incremental_copy - usb_optional_vendorid - snapshot_scheduling - snapshot_schedule_aliases - container_copy_project - clustering_server_address - clustering_image_replication - container_protection_shift - snapshot_expiry - container_backup_override_pool - snapshot_expiry_creation - network_leases_location - resources_cpu_socket - resources_gpu - resources_numa - kernel_features - id_map_current - event_location - storage_api_remote_volume_snapshots - network_nat_address - container_nic_routes - cluster_internal_copy - seccomp_notify - lxc_features - container_nic_ipvlan - network_vlan_sriov - storage_cephfs - container_nic_ipfilter - resources_v2 - container_exec_user_group_cwd - container_syscall_intercept - container_disk_shift - storage_shifted - resources_infiniband - daemon_storage - instances - image_types - resources_disk_sata - clustering_roles - images_expiry - resources_network_firmware - backup_compression_algorithm - ceph_data_pool_name - container_syscall_intercept_mount - compression_squashfs - container_raw_mount - container_nic_routed - container_syscall_intercept_mount_fuse - container_disk_ceph - virtual-machines - image_profiles - clustering_architecture - resources_disk_id - storage_lvm_stripes - vm_boot_priority - unix_hotplug_devices - api_filtering - instance_nic_network - clustering_sizing - firewall_driver - projects_limits - container_syscall_intercept_hugetlbfs - limits_hugepages - container_nic_routed_gateway - projects_restrictions - custom_volume_snapshot_expiry - volume_snapshot_scheduling - trust_ca_certificates - snapshot_disk_usage - clustering_edit_roles - container_nic_routed_host_address - container_nic_ipvlan_gateway - resources_usb_pci - resources_cpu_threads_numa - resources_cpu_core_die - api_os - container_nic_routed_host_table - container_nic_ipvlan_host_table - container_nic_ipvlan_mode - resources_system - images_push_relay - network_dns_search - container_nic_routed_limits - instance_nic_bridged_vlan - network_state_bond_bridge - usedby_consistency - custom_block_volumes - clustering_failure_domains - resources_gpu_mdev - console_vga_type - projects_limits_disk - network_type_macvlan - network_type_sriov - container_syscall_intercept_bpf_devices - network_type_ovn - projects_networks - projects_networks_restricted_uplinks - custom_volume_backup - backup_override_name - storage_rsync_compression - network_type_physical - network_ovn_external_subnets - network_ovn_nat - network_ovn_external_routes_remove - tpm_device_type - storage_zfs_clone_copy_rebase - gpu_mdev - resources_pci_iommu - resources_network_usb - resources_disk_address - network_physical_ovn_ingress_mode - network_ovn_dhcp - network_physical_routes_anycast - projects_limits_instances - network_state_vlan - instance_nic_bridged_port_isolation - instance_bulk_state_change - network_gvrp - instance_pool_move - gpu_sriov - pci_device_type - storage_volume_state - network_acl - migration_stateful - disk_state_quota - storage_ceph_features - projects_compression - projects_images_remote_cache_expiry - certificate_project - network_ovn_acl - projects_images_auto_update - projects_restricted_cluster_target - images_default_architecture - network_ovn_acl_defaults - gpu_mig - project_usage - network_bridge_acl - warnings - projects_restricted_backups_and_snapshots - clustering_join_token - clustering_description - server_trusted_proxy - clustering_update_cert - storage_api_project - server_instance_driver_operational - server_supported_storage_drivers - event_lifecycle_requestor_address - resources_gpu_usb - clustering_evacuation - network_ovn_nat_address - network_bgp - network_forward - custom_volume_refresh - network_counters_errors_dropped - metrics - image_source_project - clustering_config - network_peer - linux_sysctl - network_dns - ovn_nic_acceleration - certificate_self_renewal - instance_project_move - storage_volume_project_move - cloud_init - network_dns_nat - database_leader - instance_all_projects - clustering_groups - ceph_rbd_du - instance_get_full - qemu_metrics - gpu_mig_uuid - event_project - clustering_evacuation_live - instance_allow_inconsistent_copy - network_state_ovn - storage_volume_api_filtering - image_restrictions - storage_zfs_export - network_dns_records - storage_zfs_reserve_space - network_acl_log - storage_zfs_blocksize - metrics_cpu_seconds - instance_snapshot_never - certificate_token - instance_nic_routed_neighbor_probe - event_hub - agent_nic_config - projects_restricted_intercept - metrics_authentication - images_target_project - images_all_projects - cluster_migration_inconsistent_copy - cluster_ovn_chassis - container_syscall_intercept_sched_setscheduler - storage_lvm_thinpool_metadata_size - storage_volume_state_total - instance_file_head - instances_nic_host_name - image_copy_profile - container_syscall_intercept_sysinfo - clustering_evacuation_mode - resources_pci_vpd - qemu_raw_conf - storage_cephfs_fscache - network_load_balancer - vsock_api - instance_ready_state - network_bgp_holdtime - storage_volumes_all_projects - metrics_memory_oom_total - storage_buckets - storage_buckets_create_credentials - metrics_cpu_effective_total - projects_networks_restricted_access - storage_buckets_local - loki - acme - internal_metrics - cluster_join_token_expiry - remote_token_expiry - init_preseed - storage_volumes_created_at - cpu_hotplug - projects_networks_zones - network_txqueuelen - cluster_member_state - instances_placement_scriptlet - storage_pool_source_wipe - zfs_block_mode - instance_generation_id - disk_io_cache - amd_sev - storage_pool_loop_resize - migration_vm_live - ovn_nic_nesting - oidc - network_ovn_l3only - ovn_nic_acceleration_vdpa - cluster_healing - instances_state_total - auth_user - security_csm - instances_rebuild - numa_cpu_placement - custom_volume_iso - network_allocations - zfs_delegate - storage_api_remote_volume_snapshot_copy - operations_get_query_all_projects - metadata_configuration - syslog_socket - event_lifecycle_name_and_project - instances_nic_limits_priority - disk_initial_volume_configuration - operation_wait - image_restriction_privileged - cluster_internal_custom_volume_copy - disk_io_bus - storage_cephfs_create_missing - instance_move_config - ovn_ssl_config - certificate_description - disk_io_bus_virtio_blk - loki_config_instance - instance_create_start - clustering_evacuation_stop_options - boot_host_shutdown_action - agent_config_drive - network_state_ovn_lr - image_template_permissions - storage_bucket_backup - storage_lvm_cluster - shared_custom_block_volumes - auth_tls_jwt - oidc_claim - device_usb_serial - numa_cpu_balanced - image_restriction_nesting - network_integrations - instance_memory_swap_bytes - network_bridge_external_create - network_zones_all_projects - storage_zfs_vdev - container_migration_stateful - profiles_all_projects - instances_scriptlet_get_instances - instances_scriptlet_get_cluster_members - instances_scriptlet_get_project - network_acl_stateless - instance_state_started_at - networks_all_projects - network_acls_all_projects - storage_buckets_all_projects - resources_load - instance_access - project_access - projects_force_delete - resources_cpu_flags - disk_io_bus_cache_filesystem - instance_oci api_status: stable api_version: "1.0" auth: trusted public: false auth_methods: - tls auth_user_name: admin auth_user_method: unix environment: addresses: [] architectures: - x86_64 - i686 certificate: [omitted] certificate_fingerprint: [omitted] driver: lxc driver_version: 5.0.1 firewall: xtables kernel: Linux kernel_architecture: x86_64 kernel_features: idmapped_mounts: "true" netnsid_getifaddrs: "true" seccomp_listener: "true" seccomp_listener_continue: "true" uevent_injection: "true" unpriv_binfmt: "false" unpriv_fscaps: "true" kernel_version: 6.6.44_1 lxc_features: cgroup2: "true" core_scheduling: "true" devpts_fd: "true" idmapped_mounts_v2: "true" mount_injection_file: "true" network_gateway_device_route: "true" network_ipvlan: "true" network_l2proxy: "true" network_phys_macvlan_mtu: "true" network_veth_router: "true" pidfd: "true" seccomp_allow_deny_syntax: "true" seccomp_notify: "true" seccomp_proxy_send_notify_fd: "true" os_name: Void os_version: "" project: default server: incus server_clustered: false server_event_mode: full-mesh server_name: [omitted] server_pid: 20979 server_version: "6.3" storage: lvm storage_version: 2.03.23(2) (2023-11-21) / 1.02.197 (2023-11-21) / 4.48.0 storage_supported_drivers: - name: btrfs version: 6.9.2 remote: false - name: dir version: "1" remote: false - name: lvm version: 2.03.23(2) (2023-11-21) / 1.02.197 (2023-11-21) / 4.48.0 remote: false - name: lvmcluster version: 2.03.23(2) (2023-11-21) / 1.02.197 (2023-11-21) / 4.48.0 remote: true ```

Issue description

OCI containers in bridge networks don’t seem to renew their IPv4 DHCP leases. I noticed this because they also vanish from the DNS server.

After (re)starting the containers, they are assigned an IPv4 and IPv6 address and show up in incus network list-leases <network_name>. But after some time (probably the DHCP lease expiry, so 1h by default), their hostname fails to resolve and their IPv4 lease vanishes from the command output.

Steps to reproduce

  1. Create an OCI container (for example caddy, but the issue seems to occur with any image) and assign it to a bridge network.
  2. Try to resolve its hostname from a different instance in the network. It works.
  3. Wait for some time (probably the DHCP lease expiry, so 1h by default)
  4. Try to resolve its hostname from a different instance in the network. It does not work.

Possible Workaround

Set the ipv4.dhcp.expiry config option of the network to a high value like 8765h.

stgraber commented 2 months ago

Yeah, that's a known limitation of the current approach to DHCP. Unlike regular containers, application containers don't perform their own network configuration and so can't run a regular DHCP client.

Incus performs an initial DHCP handshake on startup through a pre-start hook but that's just a one time action.

It shouldn't be too difficult to have that process to stay around in the background but there are security concerns that would need to be considered at that stage.

maveonair commented 2 months ago

Another problem is that even if you set a static IP address for an OCI container, the DNS entry "expires" after a certain time.

The workaround for this is to simply use the configured static IP addresses instead of the DNS name when you want to communicate from another system or application container to another application container.

stgraber commented 2 months ago

Yeah, the DNS records are likely to be very useful for OCI containers, so definitely something we want to fix :)

dwlfrth commented 2 months ago

I do not want to hijack this thread but it would be awesome to loop the following too :) https://discuss.linuxcontainers.org/t/running-oci-in-incus-system-container-network-configuration/21351/1

defect-track commented 4 days ago

Would really appreciate if this could be added into one of the next minor releases.

Just had a lot fun to re-org IP's after a major system reboot. As long as the IP can stick with the container would be a start but having DNS working would be even better.

Interesting enough it seems IPV6 are not really effected? According to "incus network list-allocations" IPV6 are still listed where as IPV4 are gone.

cyphar commented 4 days ago

@defect-track

By default, Incus containers use SLAAC to configure their own IPv6 address and thus don't need DHCPv6 and so leases aren't an issue AFAICS (dnsmasq does router advertisements and so the kernel automatically configures IPv6 addresses even if there isn't a DHCP client IIUC -- net.ipv6.conf.*.accept_ra is enabled by default).

Every time I've had DHCP issues with my containers (even proper system containers), they still had IPv6 addresses (though this has lead to me not noticing a network configuration issue because my home IP has IPv6 so I can access services but folks from non-IPv6 networks can't).

defect-track commented 3 days ago

Thanks for considering adding this feature @stgraber!

Really appreciate it.