lxc / incus

Powerful system container and virtual machine manager
https://linuxcontainers.org/incus
Apache License 2.0
2.51k stars 203 forks source link

hugetlb cgroup controller does not set `rsvd`, leading to segfaults in postgres + initdb #769

Closed bazaah closed 5 months ago

bazaah commented 5 months ago

Required information

Incus Info ``` config: core.https_address: api_extensions: - storage_zfs_remove_snapshots - container_host_shutdown_timeout - container_stop_priority - container_syscall_filtering - auth_pki - container_last_used_at - etag - patch - usb_devices - https_allowed_credentials - image_compression_algorithm - directory_manipulation - container_cpu_time - storage_zfs_use_refquota - storage_lvm_mount_options - network - profile_usedby - container_push - container_exec_recording - certificate_update - container_exec_signal_handling - gpu_devices - container_image_properties - migration_progress - id_map - network_firewall_filtering - network_routes - storage - file_delete - file_append - network_dhcp_expiry - storage_lvm_vg_rename - storage_lvm_thinpool_rename - network_vlan - image_create_aliases - container_stateless_copy - container_only_migration - storage_zfs_clone_copy - unix_device_rename - storage_lvm_use_thinpool - storage_rsync_bwlimit - network_vxlan_interface - storage_btrfs_mount_options - entity_description - image_force_refresh - storage_lvm_lv_resizing - id_map_base - file_symlinks - container_push_target - network_vlan_physical - storage_images_delete - container_edit_metadata - container_snapshot_stateful_migration - storage_driver_ceph - storage_ceph_user_name - resource_limits - storage_volatile_initial_source - storage_ceph_force_osd_reuse - storage_block_filesystem_btrfs - resources - kernel_limits - storage_api_volume_rename - network_sriov - console - restrict_dev_incus - migration_pre_copy - infiniband - dev_incus_events - proxy - network_dhcp_gateway - file_get_symlink - network_leases - unix_device_hotplug - storage_api_local_volume_handling - operation_description - clustering - event_lifecycle - storage_api_remote_volume_handling - nvidia_runtime - container_mount_propagation - container_backup - dev_incus_images - container_local_cross_pool_handling - proxy_unix - proxy_udp - clustering_join - proxy_tcp_udp_multi_port_handling - network_state - proxy_unix_dac_properties - container_protection_delete - unix_priv_drop - pprof_http - proxy_haproxy_protocol - network_hwaddr - proxy_nat - network_nat_order - container_full - backup_compression - nvidia_runtime_config - storage_api_volume_snapshots - storage_unmapped - projects - network_vxlan_ttl - container_incremental_copy - usb_optional_vendorid - snapshot_scheduling - snapshot_schedule_aliases - container_copy_project - clustering_server_address - clustering_image_replication - container_protection_shift - snapshot_expiry - container_backup_override_pool - snapshot_expiry_creation - network_leases_location - resources_cpu_socket - resources_gpu - resources_numa - kernel_features - id_map_current - event_location - storage_api_remote_volume_snapshots - network_nat_address - container_nic_routes - cluster_internal_copy - seccomp_notify - lxc_features - container_nic_ipvlan - network_vlan_sriov - storage_cephfs - container_nic_ipfilter - resources_v2 - container_exec_user_group_cwd - container_syscall_intercept - container_disk_shift - storage_shifted - resources_infiniband - daemon_storage - instances - image_types - resources_disk_sata - clustering_roles - images_expiry - resources_network_firmware - backup_compression_algorithm - ceph_data_pool_name - container_syscall_intercept_mount - compression_squashfs - container_raw_mount - container_nic_routed - container_syscall_intercept_mount_fuse - container_disk_ceph - virtual-machines - image_profiles - clustering_architecture - resources_disk_id - storage_lvm_stripes - vm_boot_priority - unix_hotplug_devices - api_filtering - instance_nic_network - clustering_sizing - firewall_driver - projects_limits - container_syscall_intercept_hugetlbfs - limits_hugepages - container_nic_routed_gateway - projects_restrictions - custom_volume_snapshot_expiry - volume_snapshot_scheduling - trust_ca_certificates - snapshot_disk_usage - clustering_edit_roles - container_nic_routed_host_address - container_nic_ipvlan_gateway - resources_usb_pci - resources_cpu_threads_numa - resources_cpu_core_die - api_os - container_nic_routed_host_table - container_nic_ipvlan_host_table - container_nic_ipvlan_mode - resources_system - images_push_relay - network_dns_search - container_nic_routed_limits - instance_nic_bridged_vlan - network_state_bond_bridge - usedby_consistency - custom_block_volumes - clustering_failure_domains - resources_gpu_mdev - console_vga_type - projects_limits_disk - network_type_macvlan - network_type_sriov - container_syscall_intercept_bpf_devices - network_type_ovn - projects_networks - projects_networks_restricted_uplinks - custom_volume_backup - backup_override_name - storage_rsync_compression - network_type_physical - network_ovn_external_subnets - network_ovn_nat - network_ovn_external_routes_remove - tpm_device_type - storage_zfs_clone_copy_rebase - gpu_mdev - resources_pci_iommu - resources_network_usb - resources_disk_address - network_physical_ovn_ingress_mode - network_ovn_dhcp - network_physical_routes_anycast - projects_limits_instances - network_state_vlan - instance_nic_bridged_port_isolation - instance_bulk_state_change - network_gvrp - instance_pool_move - gpu_sriov - pci_device_type - storage_volume_state - network_acl - migration_stateful - disk_state_quota - storage_ceph_features - projects_compression - projects_images_remote_cache_expiry - certificate_project - network_ovn_acl - projects_images_auto_update - projects_restricted_cluster_target - images_default_architecture - network_ovn_acl_defaults - gpu_mig - project_usage - network_bridge_acl - warnings - projects_restricted_backups_and_snapshots - clustering_join_token - clustering_description - server_trusted_proxy - clustering_update_cert - storage_api_project - server_instance_driver_operational - server_supported_storage_drivers - event_lifecycle_requestor_address - resources_gpu_usb - clustering_evacuation - network_ovn_nat_address - network_bgp - network_forward - custom_volume_refresh - network_counters_errors_dropped - metrics - image_source_project - clustering_config - network_peer - linux_sysctl - network_dns - ovn_nic_acceleration - certificate_self_renewal - instance_project_move - storage_volume_project_move - cloud_init - network_dns_nat - database_leader - instance_all_projects - clustering_groups - ceph_rbd_du - instance_get_full - qemu_metrics - gpu_mig_uuid - event_project - clustering_evacuation_live - instance_allow_inconsistent_copy - network_state_ovn - storage_volume_api_filtering - image_restrictions - storage_zfs_export - network_dns_records - storage_zfs_reserve_space - network_acl_log - storage_zfs_blocksize - metrics_cpu_seconds - instance_snapshot_never - certificate_token - instance_nic_routed_neighbor_probe - event_hub - agent_nic_config - projects_restricted_intercept - metrics_authentication - images_target_project - images_all_projects - cluster_migration_inconsistent_copy - cluster_ovn_chassis - container_syscall_intercept_sched_setscheduler - storage_lvm_thinpool_metadata_size - storage_volume_state_total - instance_file_head - instances_nic_host_name - image_copy_profile - container_syscall_intercept_sysinfo - clustering_evacuation_mode - resources_pci_vpd - qemu_raw_conf - storage_cephfs_fscache - network_load_balancer - vsock_api - instance_ready_state - network_bgp_holdtime - storage_volumes_all_projects - metrics_memory_oom_total - storage_buckets - storage_buckets_create_credentials - metrics_cpu_effective_total - projects_networks_restricted_access - storage_buckets_local - loki - acme - internal_metrics - cluster_join_token_expiry - remote_token_expiry - init_preseed - storage_volumes_created_at - cpu_hotplug - projects_networks_zones - network_txqueuelen - cluster_member_state - instances_placement_scriptlet - storage_pool_source_wipe - zfs_block_mode - instance_generation_id - disk_io_cache - amd_sev - storage_pool_loop_resize - migration_vm_live - ovn_nic_nesting - oidc - network_ovn_l3only - ovn_nic_acceleration_vdpa - cluster_healing - instances_state_total - auth_user - security_csm - instances_rebuild - numa_cpu_placement - custom_volume_iso - network_allocations - zfs_delegate - storage_api_remote_volume_snapshot_copy - operations_get_query_all_projects - metadata_configuration - syslog_socket - event_lifecycle_name_and_project - instances_nic_limits_priority - disk_initial_volume_configuration - operation_wait - image_restriction_privileged - cluster_internal_custom_volume_copy - disk_io_bus - storage_cephfs_create_missing - instance_move_config - ovn_ssl_config - certificate_description - disk_io_bus_virtio_blk - loki_config_instance - instance_create_start - clustering_evacuation_stop_options - boot_host_shutdown_action - agent_config_drive - network_state_ovn_lr - image_template_permissions - storage_bucket_backup - storage_lvm_cluster - shared_custom_block_volumes api_status: stable api_version: "1.0" auth: trusted public: false auth_methods: - tls auth_user_name: root auth_user_method: unix environment: addresses: - architectures: - x86_64 - i686 certificate: certificate_fingerprint: driver: lxc | qemu driver_version: 5.0.3 | 8.2.2 firewall: nftables kernel: Linux kernel_architecture: x86_64 kernel_features: idmapped_mounts: "true" netnsid_getifaddrs: "true" seccomp_listener: "true" seccomp_listener_continue: "true" uevent_injection: "true" unpriv_binfmt: "true" unpriv_fscaps: "true" kernel_version: 6.8.1-arch1-1 lxc_features: cgroup2: "true" core_scheduling: "true" devpts_fd: "true" idmapped_mounts_v2: "true" mount_injection_file: "true" network_gateway_device_route: "true" network_ipvlan: "true" network_l2proxy: "true" network_phys_macvlan_mtu: "true" network_veth_router: "true" pidfd: "true" seccomp_allow_deny_syntax: "true" seccomp_notify: "true" seccomp_proxy_send_notify_fd: "true" os_name: Arch Linux os_version: "" project: default server: incus server_clustered: false server_event_mode: full-mesh server_name: server_pid: server_version: "0.6" storage: ceph storage_version: 18.2.2 storage_supported_drivers: - name: btrfs version: 6.7.1 remote: false - name: ceph version: 18.2.2 remote: true - name: cephfs version: 18.2.2 remote: true - name: cephobject version: 18.2.2 remote: true - name: dir version: "1" remote: false - name: lvm version: 2.03.23(2) (2023-11-21) / 1.02.197 (2023-11-21) / 4.48.0 remote: false ```

Issue description

While attempting to get hugepages working for an unprivileged container postgres database, I encountered repeated segfaults during the initdb sequence.

This was somewhat confusing to me, because by default postgres/initdb will attempt to use hugepages, but gracefully fallback to normal memory if unavailable, so clearly, postgres had been sufficiently induced to believe that hugepages did exist, but when it went to use them the host kernel killed the process.

Sometime later this evening I think I have it figured out.

At the end of the repro, you'll be greeted with an error like:

running bootstrap script ... 2024-04-18 00:00:00.000 UTC [1111] DEBUG: invoking IpcMemoryCreate(size=3891200) Bus error (core dumped) child process exited with exit code 135

Googling around this error brings up lots of related issues, particularly around Kubernetes deployments.

However, eventually you'll find https://github.com/opencontainers/runtime-spec/issues/1050 which explains the problem:

The previous non-rsvd max/limit_in_bytes does not account for reserved huge page memory, making it possible for a processes to reserve all the huge page memory, without being able to allocate it (due to cgroup restrictions).

In practice this makes it possible to successfully mmap more huge page memory than allowed via the cgroup settings, but when using the memory the process will get a SIGBUS and crash. This is bad for applications trying to mmap at startup (and it succeeds), but the program crashes when starting to use the memory. eg. postgres is doing this by default.

Which was fixed/added to runc in https://github.com/opencontainers/runc/pull/4073.

I'm not sure how exactly this translates to incus's codebase, but from what little digging I've done around the hugetlb controller, I can find no mention of setting hugetlb.<pagesize>.rsvd cgroups, only the older hugetlb.<pagesize>.limit_in_bytes.

Steps to reproduce

# == On the host ==

# Ensure hugepages support is enabled & we have some allocated on the system
ls -l /dev/hugepages
sysctl -w vm.nr_hugepages=1024

# Make a debian container for the demo, with what _should_ allow hugepage support
incus init images:debian/bookworm hugepages-demo
incus config set hugepages-demo limits.hugepages.2MB=512 security.syscalls.intercept.mount=true security.syscalls.intercept.mount.allowed=hugetlbfs
incus start hugepages-demo
incus exec -t hugepages-demo -- su -

# == In the container ==

# Install postgres, ignore the default db that debian happily creates for you, though do note the initdb core dumps...
apt update && apt install -y eatmydata && eatmydata -- apt install -y postgresql postgresql-contrib

# Get /dev/hugepages mounted inside the container, e.g: the mount interception stuff above works
sed '/ConditionVirtualization/d' /usr/lib/systemd/system/dev-hugepages.mount > /etc/systemd/system/dev-hugepages.mount
systemctl daemon-reload && systemctl start dev-hugepages.mount && ls -lash /dev/hugepages

# Now for the failure.
#
# We rerun the initdb debian tried previous, but with debugging turned on
pg_createcluster 15 main -- --debug

# Will print something like:
#
# running bootstrap script ... 2024-04-18 00:00:00.000 UTC [1111] DEBUG:  invoking IpcMemoryCreate(size=3891200)
# Bus error (core dumped)
# child process exited with exit code 135

Information to attach

pg_createcluster log ``` root@hugepages-demo:~# pg_createcluster 15 main -- --debug Creating new PostgreSQL cluster 15/main ... /usr/lib/postgresql/15/bin/initdb -D /var/lib/postgresql/15/main --auth-local peer --auth-host scram-sha-256 --no-instructions --debug Running in debug mode. The files belonging to this database system will be owned by user "postgres". This user must also own the server process. VERSION=15.6 (Debian 15.6-0+deb12u1) PGDATA=/var/lib/postgresql/15/main share_path=/usr/share/postgresql/15 PGPATH=/usr/lib/postgresql/15/bin POSTGRES_SUPERUSERNAME=postgres POSTGRES_BKI=/usr/share/postgresql/15/postgres.bki POSTGRESQL_CONF_SAMPLE=/usr/share/postgresql/15/postgresql.conf.sample PG_HBA_SAMPLE=/usr/share/postgresql/15/pg_hba.conf.sample PG_IDENT_SAMPLE=/usr/share/postgresql/15/pg_ident.conf.sample The database cluster will be initialized with locale "en_US.UTF-8". The default database encoding has accordingly been set to "UTF8". The default text search configuration will be set to "english". Data page checksums are disabled. fixing permissions on existing directory /var/lib/postgresql/15/main ... ok creating subdirectories ... ok selecting dynamic shared memory implementation ... posix selecting default max_connections ... 20 selecting default shared_buffers ... 400kB selecting default time zone ... Etc/UTC creating configuration files ... ok running bootstrap script ... 2024-04-18 00:41:23.828 UTC [4400] DEBUG: invoking IpcMemoryCreate(size=3891200) Bus error (core dumped) child process exited with exit code 135 initdb: removing contents of data directory "/var/lib/postgresql/15/main" Error: initdb failed ```

Side note, congrats on the first stable release of incus. I was very happy to see the project back in the hands of linuxcontainers after the Canonical announcement.

stgraber commented 5 months ago

Your system is running cgroup1? (you can show ls -lh /sys/fs/cgroup if unsure)

bazaah commented 5 months ago

cgroup2 /sys/fs/cgroup/cgroup.controllers exists

Edit: from mount:

$ mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)
stgraber commented 5 months ago

Cool, thanks. I'll try to look into this one tomorrow or Friday, looks pretty easy to sort out based on the runc change.

stgraber commented 5 months ago

Got the issue reproduced, I'll try a quick fix now but this may get postponed for a week or so as I'm about to leave on a trip :)

bazaah commented 5 months ago

Thanks for the fast turnaround, I appreciate it.