lxc / lxcfs

FUSE filesystem for LXC
https://linuxcontainers.org/lxcfs
Other
1.04k stars 249 forks source link

arm64: /proc/cpuinfo doesn't honour personality inside LXD container #553

Closed cjwatson closed 1 year ago

cjwatson commented 2 years ago

Required information

config: {}
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- candid_authentication
- backup_compression
- candid_config
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- candid_config_key
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- rbac
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
- instance_get_full
- qemu_metrics
- gpu_mig_uuid
- event_project
- clustering_evacuation_live
- instance_allow_inconsistent_copy
- network_state_ovn
- storage_volume_api_filtering
- image_restrictions
- storage_zfs_export
- network_dns_records
- storage_zfs_reserve_space
- network_acl_log
- storage_zfs_blocksize
- metrics_cpu_seconds
- instance_snapshot_never
- certificate_token
- instance_nic_routed_neighbor_probe
- event_hub
- agent_nic_config
- projects_restricted_intercept
- metrics_authentication
- images_target_project
- cluster_migration_inconsistent_copy
- cluster_ovn_chassis
- container_syscall_intercept_sched_setscheduler
- storage_lvm_thinpool_metadata_size
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
environment:
  addresses: []
  architectures:
  - aarch64
  - armv7l
  certificate: |
    -----BEGIN CERTIFICATE-----
    MIICHjCCAaSgAwIBAgIQP1pkHkT3fQQjhBX9XIa6rzAKBggqhkjOPQQDAzA9MRww
    GgYDVQQKExNsaW51eGNvbnRhaW5lcnMub3JnMR0wGwYDVQQDDBRyb290QGFybTY0
    LWFybWhmLWx4YzAeFw0yMjA4MTExMDMyMDRaFw0zMjA4MDgxMDMyMDRaMD0xHDAa
    BgNVBAoTE2xpbnV4Y29udGFpbmVycy5vcmcxHTAbBgNVBAMMFHJvb3RAYXJtNjQt
    YXJtaGYtbHhjMHYwEAYHKoZIzj0CAQYFK4EEACIDYgAESG9PtAYjDY/wMPc6bOdv
    9ZEMkiJLwPqmm7kmhDnXYzYChK5BIX98HjQgVc70NCxlcg6HkNK86naWuPAW4WTq
    NuZJOu4XEmt1+OF53GfeUVw61K5KWwjG/m2EWq5zXTIMo2kwZzAOBgNVHQ8BAf8E
    BAMCBaAwEwYDVR0lBAwwCgYIKwYBBQUHAwEwDAYDVR0TAQH/BAIwADAyBgNVHREE
    KzApgg9hcm02NC1hcm1oZi1seGOHBH8AAAGHEAAAAAAAAAAAAAAAAAAAAAEwCgYI
    KoZIzj0EAwMDaAAwZQIxAIh5o3xZ+OO/uNfAuhQZQSsd40PWrLmr33XGo1q0l/1q
    Y3LvlqbCBWm0+dwevhQc6AIwZ/BpvLKHGKEAL3Wr0DwljDbt+DrP9xtS/HjI2fhv
    iqW/P9/C2w374/Y60VkFJAWE
    -----END CERTIFICATE-----
  certificate_fingerprint: 3b96071484ef9ffabacee84629347107fe4aec5753d1f1e0ebf31d02343a55b6
  driver: lxc
  driver_version: 4.0.12
  firewall: nftables
  kernel: Linux
  kernel_architecture: aarch64
  kernel_features:
    idmapped_mounts: "true"
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "true"
    shiftfs: "false"
    uevent_injection: "true"
    unpriv_fscaps: "true"
  kernel_version: 5.15.0-46-generic
  lxc_features:
    cgroup2: "true"
    core_scheduling: "true"
    devpts_fd: "true"
    idmapped_mounts_v2: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: Ubuntu
  os_version: "22.04"
  project: default
  server: lxd
  server_clustered: false
  server_event_mode: full-mesh
  server_name: arm64-armhf-lxc
  server_pid: 1487
  server_version: 5.0.0
  storage: dir
  storage_version: "1"
  storage_supported_drivers:
  - name: ceph
    version: 15.2.14
    remote: true
  - name: btrfs
    version: 5.4.1
    remote: false
  - name: cephfs
    version: 15.2.14
    remote: true
  - name: dir
    version: "1"
    remote: false
  - name: lvm
    version: 2.03.07(2) (2019-11-30) / 1.02.167 (2019-11-30) / 4.45.0
    remote: false
  - name: zfs
    version: 2.1.4-0ubuntu0.1
    remote: false

Issue description

We've been trying to work out why Rust-based snap builds for armhf hang on Launchpad's build farm, where they're executed in armhf containers via LXD on arm64 machines, also using linux32 to set the personality (although this may not be necessary when running in a 32-bit LXD container - I think LXD already handles that?).

We seem to be running into something like https://github.com/rust-lang/rust/issues/60605, but it's a little weirder than that. rustup is only picking arm (i.e. ARMv6) because it gets confused about the processor's capabilities. rustup-init.sh has this code:

    # Detect armv7 but without the CPU features Rust needs in that build,
    # and fall back to arm.
    # See https://github.com/rust-lang/rustup.rs/issues/587.
    if [ "$_ostype" = "unknown-linux-gnueabihf" ] && [ "$_cputype" = armv7 ]; then
        if ensure grep '^Features' /proc/cpuinfo | grep -q -v neon; then
            # At least one processor does not have NEON.
            _cputype=arm
        fi
    fi

And we're seeing:

+ [ unknown-linux-gnueabihf = unknown-linux-gnueabihf ]
+ [ armv7 = armv7 ]
+ ensure grep ^Features /proc/cpuinfo
+ grep ^Features /proc/cpuinfo
+ grep -q -v neon
+ _cputype=arm

I tried to track this down in a less weird environment than a builder, launching an Ubuntu 22.04 arm64 machine as described in the lxc info output above. I got as far as this:

$ grep -m1 ^Features /proc/cpuinfo
Features        : fp asimd evtstrm cpuid
$ linux32 grep -m1 ^Features /proc/cpuinfo
Features        : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt lpae evtstrm
$ lxc launch ubuntu:bionic/armhf
Creating the instance
Instance name is: positive-hyena
Starting positive-hyena
$ lxc exec positive-hyena -- linux32 grep -m1 ^Features /proc/cpuinfo
Features        : fp asimd evtstrm cpuid

This seems pretty odd, but at this point I don't know where to look next. Is this a LXD bug for somehow failing to set up the environment correctly, or is it a kernel bug for getting confused by containerization and somehow not noticing the personality change?

Steps to reproduce

lxc launch an armhf container on arm64, and run linux32 grep -m1 ^Features /proc/cpuinfo inside it.

xnox commented 2 years ago

Does your host have compat_uts_machine=armv7l set in the kernel command-line? We do this in our lxd armhf instances on arm64 hosts on focal, because otherwise containers end up declaring that they are armv8-32 machine type which nobody uses.

It would be interesting to see the output of uname -a from inside your container.

stgraber commented 2 years ago

@cjwatson can you try umount /proc/cpuinfo in the container?

stgraber commented 2 years ago

My current guess is that the kernel has made cpuinfo to be affinity aware (urgh) but in our case, lxcfs provides /proc/cpuinfo as a FUSE overlay inside of the container (to filter the CPUs based on cgroups). LXCFS itself is an arm64 piece of code running on the host, so regardless of the personality of the caller, /proc/cpuinfo from the kernel will be accessed by an arm64 binary.

If that's indeed the issue, we can move the bug over to lxcfs and see if there's some kind of way to: 1) Determine the personality of the caller process (whatever opens /proc/cpuinfo in the container) 2) Somehow trick the kernel into providing us the cpuinfo content for that personality rather than our own

I suspect that 1) should be easy enough to figure out through some proc file, 2) may be a bit more challenging though.

cjwatson commented 2 years ago

@xnox My Canonistack test didn't have compat_uts_machine=armv7l on the command line, but Launchpad's arm64 builder VMs do. In a container on a builder, uname -a prints Linux flexible-bluejay 5.4.0-124-generic lxc/lxd#140-Ubuntu SMP Thu Aug 4 02:27:01 UTC 2022 armv7l armv7l armv7l GNU/Linux.

@stgraber You're quite right: /proc/cpuinfo is mounted, and if I unmount it then I see the correct features.

xnox commented 2 years ago
  1. Determine the personality of the caller process (whatever opens /proc/cpuinfo in the container)

cat /proc/$PID/personality should give the value of the calling process.

  1. Somehow trick the kernel into providing us the cpuinfo content for that personality rather than our own

I suspect that 1) should be easy enough to figure out through some proc file, 2) may be a bit more challenging though.

I think one can use syscall previous = personality(PER_LINUX32); to switch to 32bit, or like use whatever value one got from procfs personality file.

check return is not negative, and restore personality after one is done.

stgraber commented 2 years ago

Moving over to LXCFS. It may take us a little while before we have manpower to put on this (we'll have a new hire on it, just not sure about start date yet).

Until then, I'd recommend unmounting /proc/cpuinfo in such environments. It will have the downside of possibly over-reporting the number of CPU cores available to some tools, but that's likely less problematic than the incorrect CPU flags.

cjwatson commented 2 years ago

@stgraber Thanks for the suggestion. I've proposed https://code.launchpad.net/~cjwatson/launchpad-buildd/+git/launchpad-buildd/+merge/428923 for that.

cjwatson commented 2 years ago

This is worked around on Launchpad production now.

mihalicyn commented 1 year ago

We can try to detect called process pid on the fuse daemon side, because we have pid in struct fuse_in_header structure. And then use it to obtain personality of the caller.

lanmarc77 commented 1 year ago

@cjwatson can you try umount /proc/cpuinfo in the container?

I might have something related. On Raspbian I have a similar issue after switching to the 64bit Kernel. All containers are still 32bit. After the change multiple entries in /proc inside the containers were not updated. I also tested a 64bit container with the same result. What the entries had in common was their size of 4096bytes and not the usual 0byte. I implemented the following one liner to the startup process of every container which is a workaround for me: /usr/bin/find /proc/ -maxdepth 1 -size 4096c -exec /bin/umount {} \; I think it might be related to cpuinfo and is not limited to only this entry. But then the Raspbian Kernel is a bit special anyways. I hope someone finds this workaround useful.

mihalicyn commented 1 year ago

@lanmarc77 this issue was fixed already in https://github.com/lxc/lxcfs/pull/567 You just need to update the lxcfs on your machines.