lxc / incus

Powerful system container and virtual machine manager
https://linuxcontainers.org/incus
Apache License 2.0
2.81k stars 225 forks source link

Bug on checking project limits on a cluster #463

Closed victoitor closed 9 months ago

victoitor commented 9 months ago

Required information

pargo@dedicado01:~$ incus info
config:
  cluster.https_address: 10.11.16.31:8443
  core.https_address: 10.11.16.31:8443
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- network_sriov
- console
- restrict_dev_incus
- migration_pre_copy
- infiniband
- dev_incus_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- dev_incus_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- backup_compression
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
- instance_get_full
- qemu_metrics
- gpu_mig_uuid
- event_project
- clustering_evacuation_live
- instance_allow_inconsistent_copy
- network_state_ovn
- storage_volume_api_filtering
- image_restrictions
- storage_zfs_export
- network_dns_records
- storage_zfs_reserve_space
- network_acl_log
- storage_zfs_blocksize
- metrics_cpu_seconds
- instance_snapshot_never
- certificate_token
- instance_nic_routed_neighbor_probe
- event_hub
- agent_nic_config
- projects_restricted_intercept
- metrics_authentication
- images_target_project
- cluster_migration_inconsistent_copy
- cluster_ovn_chassis
- container_syscall_intercept_sched_setscheduler
- storage_lvm_thinpool_metadata_size
- storage_volume_state_total
- instance_file_head
- instances_nic_host_name
- image_copy_profile
- container_syscall_intercept_sysinfo
- clustering_evacuation_mode
- resources_pci_vpd
- qemu_raw_conf
- storage_cephfs_fscache
- network_load_balancer
- vsock_api
- instance_ready_state
- network_bgp_holdtime
- storage_volumes_all_projects
- metrics_memory_oom_total
- storage_buckets
- storage_buckets_create_credentials
- metrics_cpu_effective_total
- projects_networks_restricted_access
- storage_buckets_local
- loki
- acme
- internal_metrics
- cluster_join_token_expiry
- remote_token_expiry
- init_preseed
- storage_volumes_created_at
- cpu_hotplug
- projects_networks_zones
- network_txqueuelen
- cluster_member_state
- instances_placement_scriptlet
- storage_pool_source_wipe
- zfs_block_mode
- instance_generation_id
- disk_io_cache
- amd_sev
- storage_pool_loop_resize
- migration_vm_live
- ovn_nic_nesting
- oidc
- network_ovn_l3only
- ovn_nic_acceleration_vdpa
- cluster_healing
- instances_state_total
- auth_user
- security_csm
- instances_rebuild
- numa_cpu_placement
- custom_volume_iso
- network_allocations
- zfs_delegate
- storage_api_remote_volume_snapshot_copy
- operations_get_query_all_projects
- metadata_configuration
- syslog_socket
- event_lifecycle_name_and_project
- instances_nic_limits_priority
- disk_initial_volume_configuration
- operation_wait
- image_restriction_privileged
- cluster_internal_custom_volume_copy
- disk_io_bus
- storage_cephfs_create_missing
- instance_move_config
- ovn_ssl_config
- certificate_description
- disk_io_bus_virtio_blk
- loki_config_instance
- instance_create_start
- clustering_evacuation_stop_options
- boot_host_shutdown_action
- agent_config_drive
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
auth_user_name: pargo
auth_user_method: unix
environment:
  addresses:
  - 10.11.16.31:8443
  architectures:
  - x86_64
  - i686
  certificate: |
    -----BEGIN CERTIFICATE-----
    MIICCjCCAY+gAwIBAgIQUa/Tx9UBZLhQOZ+DMF248DAKBggqhkjOPQQDAzA1MRkw
    FwYDVQQKExBMaW51eCBDb250YWluZXJzMRgwFgYDVQQDDA9yb290QGRlZGljYWRv
    MDEwHhcNMjQwMjAxMTcyMTM2WhcNMzQwMTI5MTcyMTM2WjA1MRkwFwYDVQQKExBM
    aW51eCBDb250YWluZXJzMRgwFgYDVQQDDA9yb290QGRlZGljYWRvMDEwdjAQBgcq
    hkjOPQIBBgUrgQQAIgNiAATYduTc2YyCihctepmfxgpMo1Lk4RMEEgVQS8N5Bi3k
    DWGxSYHrx+OgJDdgXxqBVVe7vPNPoXuqPx6NWG5rSZIsUITqbMiNfz79S8cC8MQL
    2uLPiC4FxG8P9yqZCwBYmd6jZDBiMA4GA1UdDwEB/wQEAwIFoDATBgNVHSUEDDAK
    BggrBgEFBQcDATAMBgNVHRMBAf8EAjAAMC0GA1UdEQQmMCSCCmRlZGljYWRvMDGH
    BH8AAAGHEAAAAAAAAAAAAAAAAAAAAAEwCgYIKoZIzj0EAwMDaQAwZgIxAMJ+luSH
    UbN5+mbEWc8yFNCDu79BzqGJsxE2QCuux2n1I2jcD7nAJWjVRb00OzAZBAIxAOkW
    2DmfT1zkptY1DTDqy2R0XpdDk1WzMOAIuTjbr3hq1atLXlsui+ojAoWQEuooGA==
    -----END CERTIFICATE-----
  certificate_fingerprint: 6894193547d530fae707158fc609ffd8b75a7d9074276bc00f2808ae90530f38
  driver: qemu | lxc
  driver_version: 8.2.1 | 5.0.3
  firewall: nftables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    idmapped_mounts: "true"
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "true"
    uevent_injection: "true"
    unpriv_fscaps: "true"
  kernel_version: 6.1.0-17-amd64
  lxc_features:
    cgroup2: "true"
    core_scheduling: "true"
    devpts_fd: "true"
    idmapped_mounts_v2: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: Debian GNU/Linux
  os_version: "12"
  project: default
  server: incus
  server_clustered: true
  server_event_mode: full-mesh
  server_name: dedicado01
  server_pid: 1041
  server_version: 0.5.1
  storage: btrfs
  storage_version: "6.2"
  storage_supported_drivers:
  - name: btrfs
    version: "6.2"
    remote: false
  - name: dir
    version: "1"
    remote: false

Issue description

Project limits.instances does not work in a cluster when the value is a multiple of the number of machines in the cluster. In particular, I have a cluster with 3 machines and with no other instance running. When I create a project with limits.instances which is not a múltiple of the number of machines, the limit works. When it is a multiple of the number of machines, then the limit is one more (limit of 1 when set to 0, 4 when set to 3 and 7 when set to 6). I cannot reproduce this issue in a non-clustered environment.

My guess is related to the cluster machine scheduler since it schedules the instances cycling through the machines. Whenever it is about to start a new cycle, it starts the instance without checking project limits.

Steps to reproduce

  1. Create a cluster on 3 machines.
  2. Create a project and set limits.instances=3
  3. Attempt to create 5 instances in this project. The first 4 are created but the fifth one fails.

The precise commands I've run related to this can be found on the forum post in discuss.linuxcontainers.org. There I also show more evidence which relates this bug to the scheduler.

victoitor commented 9 months ago

Here I'll post what I found out looking at the source code. I'm new to this source code and I have never programmed in go, but I'll do my best to figure out this issue.

The function which creates an instance is instancesPost with the following definition.

func instancesPost(d *Daemon, r *http.Request) response.Response {

It's important to note right in the beginning of this function the following line, which will be important later.

    clusterNotification := isClusterNotification(r)

In the previous line, r is the request given in the input to instancesPost.

A few lines down there is a small amount of possibly problematic code which is related to a bug I posted previously in lxd.

    // If we're getting binary content, process separately
    if r.Header.Get("Content-Type") == "application/octet-stream" {
        return createFromBackup(s, r, targetProjectName, r.Body, r.Header.Get("X-Incus-pool"), r.Header.Get("X-Incus-name"))
    }

The bug I posted previously was also related to not checking project limits, but only in the case of importing an image. This part of the code is possibly problematic as it forks the code of instancesPost and checks for the same logic on both parts of the code in different ways, leading to other possible future problems. Should this be refactored so that checking project limits is done in only one place? There might be other logic which is also checked in two parts instead of just one.

For example, on createFromBackup, checking for project limits is done in the following lines in the call for s.DB.Cluster.Transaction.

    err = s.DB.Cluster.Transaction(s.ShutdownCtx, func(ctx context.Context, tx *db.ClusterTx) error {
        req = api.InstancesPost{
            InstancePut: bInfo.Config.Container.InstancePut,
            Name:        bInfo.Name,
            Source:      api.InstanceSource{}, // Only relevant for "copy" or "migration", but may not be nil.
            Type:        api.InstanceType(bInfo.Config.Container.Type),
        }

        return project.AllowInstanceCreation(tx, projectName, req)
    })

Inside instancesPost, checking for project limits is also made inside another call to s.DB.Cluster.Transaction, but on the following lines.

        if !clusterNotification {
            // Check that the project's limits are not violated. Note this check is performed after
            // automatically generated config values (such as ones from an InstanceType) have been set.
            err = project.AllowInstanceCreation(tx, targetProjectName, req)
            if err != nil {
                return err
            }

In this case, they are checked only if clusterNotification is false, which was mentioned previously. It's important to note a similar type of check is not made inside createFromBackup which exhibits a different behaviour between the two.

isClusterNotification is in another file and is very short.

// Return true if this an API request coming from a cluster node that is
// notifying us of some user-initiated API request that needs some action to be
// taken on this node as well.
func isClusterNotification(r *http.Request) bool {
    return r.Header.Get("User-Agent") == clusterRequest.UserAgentNotifier
}

I can see the comment on what it does, but I'm not sure I understand why project limits would not be checked if this returned false. Furthermore, why would some calls I made return different values of this function?

victoitor commented 9 months ago

From the code, it doesn't look like it's anything related to instance scheduling. I've been thinking about it and my (wild) guess is that isClusterNotification is cycling on machines and might return false when the call is made towards the same cluster node which originated the call. In this case, it makes no sense for project limits not to be checked as it's an unrelated test.

It's still awkward not to check for project limits in a different manner from importing an image and when creating it from scratch.

stgraber commented 9 months ago

Reproduced and debugging this one now