VM does not start if you use more 128Gi ram and 1/2/3/N GPU

sergeimonakhov commented 2 years ago

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

VM does not start if you use more 128Gi RAM and 1/2/3/N GPU. Everything works correctly on 128Gi.

What you expected to happen:

VM works correctly with GPU and RAM larger than 128Gi*

How to reproduce it (as minimally and precisely as possible): Create VM with 1 GPU and 129Gi ram or 2/3/4 GPU and 128Gi ram.

Anything else we need to know?: logs:

{"component":"virt-launcher","level":"info","msg":"Collected all requested hook sidecar sockets","pos":"manager.go:74","timestamp":"2022-02-25T12:50:14.627208Z"}
{"component":"virt-launcher","level":"info","msg":"Sorted all collected sidecar sockets per hook point based on their priority and name: map[]","pos":"manager.go:77","timestamp":"2022-02-25T12:50:14.627297Z"}
{"component":"virt-launcher","level":"info","msg":"Connecting to libvirt daemon: qemu:///system","pos":"libvirt.go:492","timestamp":"2022-02-25T12:50:14.641562Z"}
{"component":"virt-launcher","level":"info","msg":"Connecting to libvirt daemon failed: virError(Code=38, Domain=7, Message='Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory')","pos":"libvirt.go:500","timestamp":"2022-02-25T12:50:14.644998Z"}
{"component":"virt-launcher","level":"info","msg":"libvirt version: 7.6.0, package: 4.el8s (CBS \u003ccbs@centos.org\u003e, 2021-10-01-15:39:13, )","subcomponent":"libvirt","thread":"60","timestamp":"2022-02-25T12:50:14.667000Z"}
{"component":"virt-launcher","level":"info","msg":"hostname: virtual-machine-nvme","subcomponent":"libvirt","thread":"60","timestamp":"2022-02-25T12:50:14.667000Z"}
{"component":"virt-launcher","level":"error","msg":"internal error: Child process (dmidecode -q -t 0,1,2,3,4,11,17) unexpected exit status 1: /dev/mem: No such file or directory","pos":"virCommandWait:2749","subcomponent":"libvirt","thread":"60","timestamp":"2022-02-25T12:50:14.667000Z"}
{"component":"virt-launcher","level":"info","msg":"Connected to libvirt daemon","pos":"libvirt.go:508","timestamp":"2022-02-25T12:50:15.146955Z"}
{"component":"virt-launcher","level":"info","msg":"Registered libvirt event notify callback","pos":"client.go:507","timestamp":"2022-02-25T12:50:15.153956Z"}
{"component":"virt-launcher","level":"info","msg":"Marked as ready","pos":"virt-launcher.go:80","timestamp":"2022-02-25T12:50:15.154237Z"}
{"component":"virt-launcher","level":"warning","msg":"MDEV_PCI_RESOURCE_NVIDIA_COM_GPU not set for resource nvidia.com/gpu","pos":"addresspool.go:50","timestamp":"2022-02-25T12:50:15.800941Z"}
{"component":"virt-launcher","level":"info","msg":"host-device created: 0000:cb:00.0","pos":"hostdev.go:79","timestamp":"2022-02-25T12:50:15.801082Z"}
{"component":"virt-launcher","kind":"","level":"info","msg":"Executing PreStartHook on VMI pod environment","name":"virtual-machine-nvme","namespace":"test","pos":"manager.go:450","timestamp":"2022-02-25T12:50:15.801772Z","uid":"2c902644-a56c-48d4-91fa-a784927b0a90"}
{"component":"virt-launcher","kind":"","level":"info","msg":"Starting PreCloudInitIso hook","name":"virtual-machine-nvme","namespace":"test","pos":"manager.go:471","timestamp":"2022-02-25T12:50:15.801835Z","uid":"2c902644-a56c-48d4-91fa-a784927b0a90"}
{"component":"virt-launcher","level":"info","msg":"Found nameservers in /etc/resolv.conf: \n\ufffd\u0000\n","pos":"network.go:274","timestamp":"2022-02-25T12:50:15.802764Z"}
{"component":"virt-launcher","level":"info","msg":"Found search domains in /etc/resolv.conf: test.svc.k8s.local svc.k8s.local k8s.local","pos":"network.go:275","timestamp":"2022-02-25T12:50:15.802800Z"}
{"component":"virt-launcher","level":"info","msg":"Driver cache mode for /dev/datavolume set to none","pos":"converter.go:413","timestamp":"2022-02-25T12:50:15.802907Z"}
{"component":"virt-launcher","level":"info","msg":"Driver IO mode for /dev/datavolume set to native","pos":"converter.go:454","timestamp":"2022-02-25T12:50:15.802940Z"}
{"component":"virt-launcher","level":"info","msg":"Driver cache mode for /var/run/kubevirt-ephemeral-disks/cloud-init-data/test/virtual-machine-nvme/configdrive.iso set to none","pos":"converter.go:413","timestamp":"2022-02-25T12:50:15.803043Z"}
{"component":"virt-launcher","level":"info","msg":"Starting SingleClientDHCPServer","pos":"server.go:63","timestamp":"2022-02-25T12:50:15.802992Z"}
{"component":"virt-launcher","level":"info","msg":"Reaped pid 75 with status 0","pos":"virt-launcher.go:549","timestamp":"2022-02-25T12:50:15.921378Z"}
{"component":"virt-launcher","level":"info","msg":"Reaped pid 77 with status 9","pos":"virt-launcher.go:549","timestamp":"2022-02-25T12:50:16.059875Z"}
{"component":"virt-launcher","level":"info","msg":"Reaped pid 83 with status 0","pos":"virt-launcher.go:549","timestamp":"2022-02-25T12:50:16.089786Z"}
{"component":"virt-launcher","level":"info","msg":"Reaped pid 85 with status 9","pos":"virt-launcher.go:549","timestamp":"2022-02-25T12:50:16.128911Z"}
{"component":"virt-launcher","kind":"","level":"info","msg":"Domain defined.","name":"virtual-machine-nvme","namespace":"test","pos":"manager.go:722","timestamp":"2022-02-25T12:50:16.452443Z","uid":"2c902644-a56c-48d4-91fa-a784927b0a90"}
{"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 0 with reason 0 received","pos":"client.go:433","timestamp":"2022-02-25T12:50:16.452724Z"}
{"component":"virt-launcher","level":"info","msg":"kubevirt domain status: Shutoff(5):Unknown(0)","pos":"client.go:283","timestamp":"2022-02-25T12:50:16.456999Z"}
{"component":"virt-launcher","level":"info","msg":"Successfully connected to domain notify socket at /var/run/kubevirt/domain-notify-pipe.sock","pos":"client.go:162","timestamp":"2022-02-25T12:50:16.460907Z"}
{"component":"virt-launcher","level":"info","msg":"Domain name event: test_virtual-machine-nvme","pos":"client.go:408","timestamp":"2022-02-25T12:50:16.464192Z"}
{"component":"virt-launcher","level":"info","msg":"Monitoring loop: rate 1s start timeout 4m48s","pos":"monitor.go:177","timestamp":"2022-02-25T12:50:16.466945Z"}
{"component":"virt-launcher","level":"info","msg":"generated nocloud iso file /var/run/kubevirt-ephemeral-disks/cloud-init-data/test/virtual-machine-nvme/configdrive.iso","pos":"cloud-init.go:639","timestamp":"2022-02-25T12:50:16.564337Z"}
{"component":"virt-launcher","level":"error","msg":"At least one cgroup controller is required: No such device or address","pos":"virCgroupDetectControllers:455","subcomponent":"libvirt","thread":"47","timestamp":"2022-02-25T12:50:16.605000Z"}
{"component":"virt-launcher","level":"info","msg":"2022-02-25 12:50:16.598+0000: starting up libvirt version: 7.6.0, package: 4.el8s (CBS \u003ccbs@centos.org\u003e, 2021-10-01-15:39:13, ), qemu version: 6.0.0qemu-kvm-6.0.0-33.el8s, kernel: 5.13.0-28-generic, hostname: virtual-machine-nvme","subcomponent":"qemu","timestamp":"2022-02-25T12:50:16.642827Z"}
{"component":"virt-launcher","level":"info","msg":"LC_ALL=C \\PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin \\HOME=/var/lib/libvirt/qemu/domain-1-7db81398-eacc-404c-a \\XDG_DATA_HOME=/var/lib/libvirt/qemu/domain-1-7db81398-eacc-404c-a/.local/share \\XDG_CACHE_HOME=/var/lib/libvirt/qemu/domain-1-7db81398-eacc-404c-a/.cache \\XDG_CONFIG_HOME=/var/lib/libvirt/qemu/domain-1-7db81398-eacc-404c-a/.config \\/usr/libexec/qemu-kvm \\-name guest=test_virtual-machine-nvme,debug-threads=on \\-S \\-object '{\"qom-type\":\"secret\",\"id\":\"masterKey0\",\"format\":\"raw\",\"file\":\"/var/lib/libvirt/qemu/domain-1-7db81398-eacc-404c-a/master-key.aes\"}' \\-machine pc-q35-rhel8.5.0,accel=kvm,usb=off,dump-guest-core=off,memory-backend=pc.ram \\-cpu EPYC-Rome,x2apic=on,tsc-deadline=on,hypervisor=on,tsc-adjust=on,spec-ctrl=on,stibp=on,arch-capabilities=on,ssbd=on,xsaves=on,cmp-legacy=on,ibrs=on,amd-ssbd=on,virt-ssbd=on,svme-addr-chk=on,rdctl-no=on,skip-l1dfl-vmentry=on,mds-no=on,pschange-mc-no=on,kvm=off \\-m 262144 \\-object '{\"qom-type\":\"memory-backend-ram\",\"id\":\"pc.ram\",\"size\":274877906944}' \\-overcommit mem-lock=off \\-smp 64,sockets=64,dies=1,cores=1,threads=1 \\-object '{\"qom-type\":\"iothread\",\"id\":\"iothread1\"}' \\-uuid 85383141-65ab-51c9-a2d0-e9d6b0f9543d \\-smbios type=1,manufacturer=KubeVirt,product=None,uuid=85383141-65ab-51c9-a2d0-e9d6b0f9543d,family=KubeVirt \\-no-user-config \\-nodefaults \\-chardev socket,id=charmonitor,fd=19,server=on,wait=off \\-mon chardev=charmonitor,id=monitor,mode=control \\-rtc base=utc \\-no-shutdown \\-boot strict=on \\-device pcie-root-port,port=0x10,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x2 \\-device pcie-root-port,port=0x11,chassis=2,id=pci.2,bus=pcie.0,addr=0x2.0x1 \\-device pcie-root-port,port=0x12,chassis=3,id=pci.3,bus=pcie.0,addr=0x2.0x2 \\-device pcie-root-port,port=0x13,chassis=4,id=pci.4,bus=pcie.0,addr=0x2.0x3 \\-device pcie-root-port,port=0x14,chassis=5,id=pci.5,bus=pcie.0,addr=0x2.0x4 \\-device pcie-root-port,port=0x15,chassis=6,id=pci.6,bus=pcie.0,addr=0x2.0x5 \\-device pcie-root-port,port=0x16,chassis=7,id=pci.7,bus=pcie.0,addr=0x2.0x6 \\-device pcie-root-port,port=0x17,chassis=8,id=pci.8,bus=pcie.0,addr=0x2.0x7 \\-device virtio-scsi-pci-non-transitional,id=scsi0,bus=pci.2,addr=0x0 \\-device virtio-serial-pci-non-transitional,id=virtio-serial0,bus=pci.3,addr=0x0 \\-blockdev '{\"driver\":\"host_device\",\"filename\":\"/dev/datavolume\",\"aio\":\"native\",\"node-name\":\"libvirt-2-storage\",\"cache\":{\"direct\":true,\"no-flush\":false},\"auto-read-only\":true,\"discard\":\"unmap\"}' \\-blockdev '{\"node-name\":\"libvirt-2-format\",\"read-only\":false,\"discard\":\"unmap\",\"cache\":{\"direct\":true,\"no-flush\":false},\"driver\":\"raw\",\"file\":\"libvirt-2-storage\"}' \\-device virtio-blk-pci-non-transitional,bus=pci.4,addr=0x0,drive=libvirt-2-format,id=ua-datavolume,bootindex=1,write-cache=on,werror=stop,rerror=stop \\-blockdev '{\"driver\":\"file\",\"filename\":\"/var/run/kubevirt-ephemeral-disks/cloud-init-data/test/virtual-machine-nvme/configdrive.iso\",\"node-name\":\"libvirt-1-storage\",\"cache\":{\"direct\":true,\"no-flush\":false},\"auto-read-only\":true,\"discard\":\"unmap\"}' \\-blockdev '{\"node-name\":\"libvirt-1-format\",\"read-only\":false,\"discard\":\"unmap\",\"cache\":{\"direct\":true,\"no-flush\":false},\"driver\":\"raw\",\"file\":\"libvirt-1-storage\"}' \\-device virtio-blk-pci-non-transitional,bus=pci.5,addr=0x0,drive=libvirt-1-format,id=ua-cloudinitdisk,write-cache=on,werror=stop,rerror=stop \\-netdev tap,fd=21,id=hostua-default,vhost=on,vhostfd=22 \\-device virtio-net-pci-non-transitional,host_mtu=9000,netdev=hostua-default,id=ua-default,mac=52:54:00:0a:b3:7b,bus=pci.1,addr=0x0,romfile= \\-chardev socket,id=charserial0,fd=23,server=on,wait=off \\-device isa-serial,chardev=charserial0,id=serial0 \\-chardev socket,id=charchannel0,fd=24,server=on,wait=off \\-device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 \\-audiodev id=audio1,driver=none \\-vnc vnc=unix:/var/run/kubevirt-private/2c902644-a56c-48d4-91fa-a784927b0a90/virt-vnc,audiodev=audio1 \\-device VGA,id=video0,vgamem_mb=16,bus=pcie.0,addr=0x1 \\-device vfio-pci,host=0000:cb:00.0,id=ua-gpu-gpu1,bus=pci.6,addr=0x0 \\-device virtio-balloon-pci-non-transitional,id=balloon0,bus=pci.7,addr=0x0 \\-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \\-msg timestamp=on","subcomponent":"qemu","timestamp":"2022-02-25T12:50:16.643054Z"}
{"component":"virt-launcher","level":"info","msg":"Found PID for 85383141-65ab-51c9-a2d0-e9d6b0f9543d: 94","pos":"monitor.go:139","timestamp":"2022-02-25T12:50:17.468809Z"}
{"component":"virt-launcher","level":"info","msg":"GuestAgentLifecycle event state 2 with reason 1 received","pos":"client.go:490","timestamp":"2022-02-25T12:50:49.899438Z"}
{"component":"virt-launcher","level":"info","msg":"kubevirt domain status: Paused(3):StartingUp(11)","pos":"client.go:283","timestamp":"2022-02-25T12:50:49.902399Z"}
{"component":"virt-launcher","level":"info","msg":"Domain name event: test_virtual-machine-nvme","pos":"client.go:408","timestamp":"2022-02-25T12:50:49.904253Z"}
{"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 4 with reason 0 received","pos":"client.go:433","timestamp":"2022-02-25T12:50:50.100936Z"}
{"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 2 with reason 0 received","pos":"client.go:433","timestamp":"2022-02-25T12:50:50.105806Z"}
{"component":"virt-launcher","kind":"","level":"info","msg":"Domain started.","name":"virtual-machine-nvme","namespace":"test","pos":"manager.go:750","timestamp":"2022-02-25T12:50:50.107522Z","uid":"2c902644-a56c-48d4-91fa-a784927b0a90"}
{"component":"virt-launcher","level":"info","msg":"kubevirt domain status: Running(1):Unknown(1)","pos":"client.go:283","timestamp":"2022-02-25T12:50:50.108313Z"}
{"component":"virt-launcher","level":"info","msg":"Domain name event: test_virtual-machine-nvme","pos":"client.go:408","timestamp":"2022-02-25T12:50:50.111450Z"}
{"component":"virt-launcher","kind":"","level":"info","msg":"Synced vmi","name":"virtual-machine-nvme","namespace":"test","pos":"server.go:190","timestamp":"2022-02-25T12:50:50.111993Z","uid":"2c902644-a56c-48d4-91fa-a784927b0a90"}
{"component":"virt-launcher","level":"warning","msg":"MDEV_PCI_RESOURCE_NVIDIA_COM_GPU not set for resource nvidia.com/gpu","pos":"addresspool.go:50","timestamp":"2022-02-25T12:50:50.112203Z"}
{"component":"virt-launcher","level":"info","msg":"host-device created: 0000:cb:00.0","pos":"hostdev.go:79","timestamp":"2022-02-25T12:50:50.112275Z"}
{"component":"virt-launcher","level":"info","msg":"kubevirt domain status: Running(1):Unknown(1)","pos":"client.go:283","timestamp":"2022-02-25T12:50:50.114825Z"}
{"component":"virt-launcher","kind":"","level":"info","msg":"Synced vmi","name":"virtual-machine-nvme","namespace":"test","pos":"server.go:190","timestamp":"2022-02-25T12:50:50.115432Z","uid":"2c902644-a56c-48d4-91fa-a784927b0a90"}
{"component":"virt-launcher","level":"info","msg":"Domain name event: test_virtual-machine-nvme","pos":"client.go:408","timestamp":"2022-02-25T12:50:50.116613Z"}
{"component":"virt-launcher","level":"warning","msg":"MDEV_PCI_RESOURCE_NVIDIA_COM_GPU not set for resource nvidia.com/gpu","pos":"addresspool.go:50","timestamp":"2022-02-25T12:50:50.158810Z"}
{"component":"virt-launcher","level":"info","msg":"host-device created: 0000:cb:00.0","pos":"hostdev.go:79","timestamp":"2022-02-25T12:50:50.158923Z"}
{"component":"virt-launcher","kind":"","level":"info","msg":"Synced vmi","name":"virtual-machine-nvme","namespace":"test","pos":"server.go:190","timestamp":"2022-02-25T12:50:50.162269Z","uid":"2c902644-a56c-48d4-91fa-a784927b0a90"}
{"component":"virt-launcher","level":"warning","msg":"MDEV_PCI_RESOURCE_NVIDIA_COM_GPU not set for resource nvidia.com/gpu","pos":"addresspool.go:50","timestamp":"2022-02-25T12:50:50.288742Z"}
{"component":"virt-launcher","level":"info","msg":"host-device created: 0000:cb:00.0","pos":"hostdev.go:79","timestamp":"2022-02-25T12:50:50.288838Z"}
{"component":"virt-launcher","kind":"","level":"info","msg":"Synced vmi","name":"virtual-machine-nvme","namespace":"test","pos":"server.go:190","timestamp":"2022-02-25T12:50:50.292652Z","uid":"2c902644-a56c-48d4-91fa-a784927b0a90"}
{"component":"virt-launcher","level":"warning","msg":"MDEV_PCI_RESOURCE_NVIDIA_COM_GPU not set for resource nvidia.com/gpu","pos":"addresspool.go:50","timestamp":"2022-02-25T12:50:50.314774Z"}
{"component":"virt-launcher","level":"info","msg":"host-device created: 0000:cb:00.0","pos":"hostdev.go:79","timestamp":"2022-02-25T12:50:50.314866Z"}
{"component":"virt-launcher","kind":"","level":"info","msg":"Synced vmi","name":"virtual-machine-nvme","namespace":"test","pos":"server.go:190","timestamp":"2022-02-25T12:50:50.318719Z","uid":"2c902644-a56c-48d4-91fa-a784927b0a90"}
{"component":"virt-launcher","level":"info","msg":"Process 85383141-65ab-51c9-a2d0-e9d6b0f9543d and pid 94 is a zombie, sending SIGCHLD to pid 1 to reap process","pos":"monitor.go:155","timestamp":"2022-02-25T12:50:52.467304Z"}
{"component":"virt-launcher","level":"info","msg":"Waiting on final notifications to be sent to virt-handler.","pos":"virt-launcher.go:277","timestamp":"2022-02-25T12:50:52.467404Z"}
{"component":"virt-launcher","level":"info","msg":"Reaped pid 0 with status 0","pos":"virt-launcher.go:549","timestamp":"2022-02-25T12:50:52.467556Z"}
{"component":"virt-launcher","level":"error","msg":"internal error: End of file from qemu monitor","pos":"qemuMonitorIO:582","subcomponent":"libvirt","thread":"95","timestamp":"2022-02-25T12:50:53.813000Z"}
{"component":"virt-launcher","level":"info","msg":"Reaped pid 94 with status 9","pos":"virt-launcher.go:549","timestamp":"2022-02-25T12:50:56.147688Z"}
{"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 5 with reason 5 received","pos":"client.go:433","timestamp":"2022-02-25T12:50:56.220935Z"}
{"component":"virt-launcher","level":"info","msg":"kubevirt domain status: Shutoff(5):Crashed(3)","pos":"client.go:283","timestamp":"2022-02-25T12:50:56.225347Z"}
{"component":"virt-launcher","level":"info","msg":"Domain name event: test_virtual-machine-nvme","pos":"client.go:408","timestamp":"2022-02-25T12:50:56.227309Z"}
{"component":"virt-launcher","kind":"","level":"info","msg":"Domain undefined.","name":"virtual-machine-nvme","namespace":"test","pos":"manager.go:1252","timestamp":"2022-02-25T12:50:56.256561Z","uid":"2c902644-a56c-48d4-91fa-a784927b0a90"}
{"component":"virt-launcher","kind":"","level":"info","msg":"Signaled vmi deletion","name":"virtual-machine-nvme","namespace":"test","pos":"server.go:312","timestamp":"2022-02-25T12:50:56.256661Z","uid":"2c902644-a56c-48d4-91fa-a784927b0a90"}
{"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 1 with reason 0 received","pos":"client.go:433","timestamp":"2022-02-25T12:50:56.256663Z"}
{"component":"virt-launcher","level":"info","msg":"Domain name event: ","pos":"client.go:408","timestamp":"2022-02-25T12:50:56.257771Z"}
{"component":"virt-launcher","level":"info","msg":"Final Delete notification sent","pos":"virt-launcher.go:292","timestamp":"2022-02-25T12:50:56.257828Z"}
{"component":"virt-launcher","level":"info","msg":"stopping cmd server","pos":"server.go:547","timestamp":"2022-02-25T12:50:56.257859Z"}
{"component":"virt-launcher","level":"info","msg":"Received signal terminated","pos":"virt-launcher.go:490","timestamp":"2022-02-25T12:50:56.334042Z"}
{"component":"virt-launcher","level":"error","msg":"timeout on stopping the cmd server, continuing anyway.","pos":"server.go:558","timestamp":"2022-02-25T12:50:57.258369Z"}
{"component":"virt-launcher","level":"info","msg":"Exiting...","pos":"virt-launcher.go:519","timestamp":"2022-02-25T12:50:57.258496Z"}
{"component":"virt-launcher","level":"info","msg":"Reaped pid 25 with status 0","pos":"virt-launcher.go:549","timestamp":"2022-02-25T12:50:57.266174Z"}
{"component":"virt-launcher","level":"error","msg":"error when checking for istio-proxy presence","pos":"virt-launcher.go:657","reason":"Get \"http://localhost:15021/healthz/ready\": dial tcp 127.0.0.1:15021: connect: no route to host","timestamp":"2022-02-25T12:51:00.092210Z"}

Environment:

KubeVirt version (use virtctl version): 0.47.1
Kubernetes version (use kubectl version): 1.19.7

VM or VMI specifications:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: virtual-machine-test-gpu
spec:
runStrategy: RerunOnFailure
template:
spec:
  domain:
    cpu:
      dedicatedCpuPlacement: true
    devices:
      disks:
      - disk:
          bus: virtio
        name: datavolume
      - disk:
          bus: virtio
        name: cloudinitdisk
      gpus:
      - deviceName: nvidia.com/gpu
        name: gpu1
      interfaces:
      - masquerade: {}
        name: default
    machine:
      type: q35
    resources:
      limits:
        cpu: "64"
        memory: 129Gi
      requests:
        cpu: "64"
        memory: 129Gi
  networks:
  - name: default
    pod: {}
  nodeSelector:
    gpu.nvidia.com/model: A100_PCIE_40GB
  terminationGracePeriodSeconds: 300
  volumes:
  - dataVolume:
      name: virtual-machine-test-gpu
    name: datavolume
  - cloudInitConfigDrive:
      userData: |
        #cloud-config
        password: ubuntu
        chpasswd: { expire: False }
        ssh_pwauth: True
    name: cloudinitdisk

Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release): Ubuntu 20.04.3 LTS
Kernel (e.g. uname -a): Linux node5 5.13.0-28-generic #31~20.04.1-Ubuntu SMP Wed Jan 19 14:08:10 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
Others:

dhiller commented 2 years ago

Does this happen also when there are no GPUs involved? We would like to narrow down when this is happening exactly?

sergeimonakhov commented 2 years ago

Does this happen also when there are no GPUs involved? We would like to narrow down when this is happening exactly?

Without gpu no problem any memory size

dhiller commented 2 years ago

@vladikr do you have an idea how memory size could be related to gpu?

sergeimonakhov commented 2 years ago

Hi! do you have any ideas? I can run additional tests

sergeimonakhov commented 2 years ago

@dhiller @vladikr

xpivarc commented 2 years ago

Hi @D1abloRUS , I think logs with higher verbosity might be helpful. Also, can you reproduce this with plain qemu/outside of Kubevirt?

sergeimonakhov commented 2 years ago

Hi, @xpivarc @vladikr The problem is related to the oom on qemu-kvm, I had to allocate more memory for qemu-kvm. I have compiled a table. There are correlations between the number of GPUs and RAM and how much memory should be left for qemu-kvm so that it does not crash, do you have any ideas what this might be related to?

sergeimonakhov commented 2 years ago

There is also a problem, no matter how much memory I leave to the guest, the vm with vfio does not start if the memory is more than 480gb

xpivarc commented 2 years ago

Hi @D1abloRUS I will try to look into it closely. Let me recap to be sure I understand. The first problem was that our overhead calculation seems to be wrong when multiple gpus are requested? Why does it not start with 480gb, is it the same issue, and do you have a plain qemu reference?

vladikr commented 2 years ago

Hi, I somehow missed this issue. very sorry for my late reply. In general, qemu/libvirt and consequently KubeVirt adds a 1GB "fudge factor" to the memory overheads and lock it. Unfortunately, a manual adjustment to this overhead calculation is necessary when multiple vfio devices are present, This is also dependent on the assigned device itself; we've seen some devices consume more qemu memory than others.

Regarding, the 480gb limits - I didn't know that such limit exist. I'll look into that.

vladikr commented 2 years ago

btw @D1abloRUS, when you're allocating >480gb - are you running a single VMI per node? Are you allocating RAM or are these hugepages?

sergeimonakhov commented 2 years ago

@vladikr hi,

are you running a single VMI per node?

Yes

Are you allocating RAM or are these hugepages?

RAM