kubevirt / kubevirt

Kubernetes Virtualization API and runtime in order to define and manage virtual machines.
https://kubevirt.io
Apache License 2.0
5.52k stars 1.32k forks source link

VM does not start if you use more 128Gi ram and 1/2/3/N GPU #7279

Closed sergeimonakhov closed 11 months ago

sergeimonakhov commented 2 years ago

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

VM does not start if you use more 128Gi RAM and 1/2/3/N GPU. Everything works correctly on 128Gi.

What you expected to happen:

VM works correctly with GPU and RAM larger than 128Gi*

How to reproduce it (as minimally and precisely as possible): Create VM with 1 GPU and 129Gi ram or 2/3/4 GPU and 128Gi ram.

Anything else we need to know?: logs:

{"component":"virt-launcher","level":"info","msg":"Collected all requested hook sidecar sockets","pos":"manager.go:74","timestamp":"2022-02-25T12:50:14.627208Z"}
{"component":"virt-launcher","level":"info","msg":"Sorted all collected sidecar sockets per hook point based on their priority and name: map[]","pos":"manager.go:77","timestamp":"2022-02-25T12:50:14.627297Z"}
{"component":"virt-launcher","level":"info","msg":"Connecting to libvirt daemon: qemu:///system","pos":"libvirt.go:492","timestamp":"2022-02-25T12:50:14.641562Z"}
{"component":"virt-launcher","level":"info","msg":"Connecting to libvirt daemon failed: virError(Code=38, Domain=7, Message='Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory')","pos":"libvirt.go:500","timestamp":"2022-02-25T12:50:14.644998Z"}
{"component":"virt-launcher","level":"info","msg":"libvirt version: 7.6.0, package: 4.el8s (CBS \u003ccbs@centos.org\u003e, 2021-10-01-15:39:13, )","subcomponent":"libvirt","thread":"60","timestamp":"2022-02-25T12:50:14.667000Z"}
{"component":"virt-launcher","level":"info","msg":"hostname: virtual-machine-nvme","subcomponent":"libvirt","thread":"60","timestamp":"2022-02-25T12:50:14.667000Z"}
{"component":"virt-launcher","level":"error","msg":"internal error: Child process (dmidecode -q -t 0,1,2,3,4,11,17) unexpected exit status 1: /dev/mem: No such file or directory","pos":"virCommandWait:2749","subcomponent":"libvirt","thread":"60","timestamp":"2022-02-25T12:50:14.667000Z"}
{"component":"virt-launcher","level":"info","msg":"Connected to libvirt daemon","pos":"libvirt.go:508","timestamp":"2022-02-25T12:50:15.146955Z"}
{"component":"virt-launcher","level":"info","msg":"Registered libvirt event notify callback","pos":"client.go:507","timestamp":"2022-02-25T12:50:15.153956Z"}
{"component":"virt-launcher","level":"info","msg":"Marked as ready","pos":"virt-launcher.go:80","timestamp":"2022-02-25T12:50:15.154237Z"}
{"component":"virt-launcher","level":"warning","msg":"MDEV_PCI_RESOURCE_NVIDIA_COM_GPU not set for resource nvidia.com/gpu","pos":"addresspool.go:50","timestamp":"2022-02-25T12:50:15.800941Z"}
{"component":"virt-launcher","level":"info","msg":"host-device created: 0000:cb:00.0","pos":"hostdev.go:79","timestamp":"2022-02-25T12:50:15.801082Z"}
{"component":"virt-launcher","kind":"","level":"info","msg":"Executing PreStartHook on VMI pod environment","name":"virtual-machine-nvme","namespace":"test","pos":"manager.go:450","timestamp":"2022-02-25T12:50:15.801772Z","uid":"2c902644-a56c-48d4-91fa-a784927b0a90"}
{"component":"virt-launcher","kind":"","level":"info","msg":"Starting PreCloudInitIso hook","name":"virtual-machine-nvme","namespace":"test","pos":"manager.go:471","timestamp":"2022-02-25T12:50:15.801835Z","uid":"2c902644-a56c-48d4-91fa-a784927b0a90"}
{"component":"virt-launcher","level":"info","msg":"Found nameservers in /etc/resolv.conf: \n\ufffd\u0000\n","pos":"network.go:274","timestamp":"2022-02-25T12:50:15.802764Z"}
{"component":"virt-launcher","level":"info","msg":"Found search domains in /etc/resolv.conf: test.svc.k8s.local svc.k8s.local k8s.local","pos":"network.go:275","timestamp":"2022-02-25T12:50:15.802800Z"}
{"component":"virt-launcher","level":"info","msg":"Driver cache mode for /dev/datavolume set to none","pos":"converter.go:413","timestamp":"2022-02-25T12:50:15.802907Z"}
{"component":"virt-launcher","level":"info","msg":"Driver IO mode for /dev/datavolume set to native","pos":"converter.go:454","timestamp":"2022-02-25T12:50:15.802940Z"}
{"component":"virt-launcher","level":"info","msg":"Driver cache mode for /var/run/kubevirt-ephemeral-disks/cloud-init-data/test/virtual-machine-nvme/configdrive.iso set to none","pos":"converter.go:413","timestamp":"2022-02-25T12:50:15.803043Z"}
{"component":"virt-launcher","level":"info","msg":"Starting SingleClientDHCPServer","pos":"server.go:63","timestamp":"2022-02-25T12:50:15.802992Z"}
{"component":"virt-launcher","level":"info","msg":"Reaped pid 75 with status 0","pos":"virt-launcher.go:549","timestamp":"2022-02-25T12:50:15.921378Z"}
{"component":"virt-launcher","level":"info","msg":"Reaped pid 77 with status 9","pos":"virt-launcher.go:549","timestamp":"2022-02-25T12:50:16.059875Z"}
{"component":"virt-launcher","level":"info","msg":"Reaped pid 83 with status 0","pos":"virt-launcher.go:549","timestamp":"2022-02-25T12:50:16.089786Z"}
{"component":"virt-launcher","level":"info","msg":"Reaped pid 85 with status 9","pos":"virt-launcher.go:549","timestamp":"2022-02-25T12:50:16.128911Z"}
{"component":"virt-launcher","kind":"","level":"info","msg":"Domain defined.","name":"virtual-machine-nvme","namespace":"test","pos":"manager.go:722","timestamp":"2022-02-25T12:50:16.452443Z","uid":"2c902644-a56c-48d4-91fa-a784927b0a90"}
{"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 0 with reason 0 received","pos":"client.go:433","timestamp":"2022-02-25T12:50:16.452724Z"}
{"component":"virt-launcher","level":"info","msg":"kubevirt domain status: Shutoff(5):Unknown(0)","pos":"client.go:283","timestamp":"2022-02-25T12:50:16.456999Z"}
{"component":"virt-launcher","level":"info","msg":"Successfully connected to domain notify socket at /var/run/kubevirt/domain-notify-pipe.sock","pos":"client.go:162","timestamp":"2022-02-25T12:50:16.460907Z"}
{"component":"virt-launcher","level":"info","msg":"Domain name event: test_virtual-machine-nvme","pos":"client.go:408","timestamp":"2022-02-25T12:50:16.464192Z"}
{"component":"virt-launcher","level":"info","msg":"Monitoring loop: rate 1s start timeout 4m48s","pos":"monitor.go:177","timestamp":"2022-02-25T12:50:16.466945Z"}
{"component":"virt-launcher","level":"info","msg":"generated nocloud iso file /var/run/kubevirt-ephemeral-disks/cloud-init-data/test/virtual-machine-nvme/configdrive.iso","pos":"cloud-init.go:639","timestamp":"2022-02-25T12:50:16.564337Z"}
{"component":"virt-launcher","level":"error","msg":"At least one cgroup controller is required: No such device or address","pos":"virCgroupDetectControllers:455","subcomponent":"libvirt","thread":"47","timestamp":"2022-02-25T12:50:16.605000Z"}
{"component":"virt-launcher","level":"info","msg":"2022-02-25 12:50:16.598+0000: starting up libvirt version: 7.6.0, package: 4.el8s (CBS \u003ccbs@centos.org\u003e, 2021-10-01-15:39:13, ), qemu version: 6.0.0qemu-kvm-6.0.0-33.el8s, kernel: 5.13.0-28-generic, hostname: virtual-machine-nvme","subcomponent":"qemu","timestamp":"2022-02-25T12:50:16.642827Z"}
{"component":"virt-launcher","level":"info","msg":"LC_ALL=C \\PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin \\HOME=/var/lib/libvirt/qemu/domain-1-7db81398-eacc-404c-a \\XDG_DATA_HOME=/var/lib/libvirt/qemu/domain-1-7db81398-eacc-404c-a/.local/share \\XDG_CACHE_HOME=/var/lib/libvirt/qemu/domain-1-7db81398-eacc-404c-a/.cache \\XDG_CONFIG_HOME=/var/lib/libvirt/qemu/domain-1-7db81398-eacc-404c-a/.config \\/usr/libexec/qemu-kvm \\-name guest=test_virtual-machine-nvme,debug-threads=on \\-S \\-object '{\"qom-type\":\"secret\",\"id\":\"masterKey0\",\"format\":\"raw\",\"file\":\"/var/lib/libvirt/qemu/domain-1-7db81398-eacc-404c-a/master-key.aes\"}' \\-machine pc-q35-rhel8.5.0,accel=kvm,usb=off,dump-guest-core=off,memory-backend=pc.ram \\-cpu EPYC-Rome,x2apic=on,tsc-deadline=on,hypervisor=on,tsc-adjust=on,spec-ctrl=on,stibp=on,arch-capabilities=on,ssbd=on,xsaves=on,cmp-legacy=on,ibrs=on,amd-ssbd=on,virt-ssbd=on,svme-addr-chk=on,rdctl-no=on,skip-l1dfl-vmentry=on,mds-no=on,pschange-mc-no=on,kvm=off \\-m 262144 \\-object '{\"qom-type\":\"memory-backend-ram\",\"id\":\"pc.ram\",\"size\":274877906944}' \\-overcommit mem-lock=off \\-smp 64,sockets=64,dies=1,cores=1,threads=1 \\-object '{\"qom-type\":\"iothread\",\"id\":\"iothread1\"}' \\-uuid 85383141-65ab-51c9-a2d0-e9d6b0f9543d \\-smbios type=1,manufacturer=KubeVirt,product=None,uuid=85383141-65ab-51c9-a2d0-e9d6b0f9543d,family=KubeVirt \\-no-user-config \\-nodefaults \\-chardev socket,id=charmonitor,fd=19,server=on,wait=off \\-mon chardev=charmonitor,id=monitor,mode=control \\-rtc base=utc \\-no-shutdown \\-boot strict=on \\-device pcie-root-port,port=0x10,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x2 \\-device pcie-root-port,port=0x11,chassis=2,id=pci.2,bus=pcie.0,addr=0x2.0x1 \\-device pcie-root-port,port=0x12,chassis=3,id=pci.3,bus=pcie.0,addr=0x2.0x2 \\-device pcie-root-port,port=0x13,chassis=4,id=pci.4,bus=pcie.0,addr=0x2.0x3 \\-device pcie-root-port,port=0x14,chassis=5,id=pci.5,bus=pcie.0,addr=0x2.0x4 \\-device pcie-root-port,port=0x15,chassis=6,id=pci.6,bus=pcie.0,addr=0x2.0x5 \\-device pcie-root-port,port=0x16,chassis=7,id=pci.7,bus=pcie.0,addr=0x2.0x6 \\-device pcie-root-port,port=0x17,chassis=8,id=pci.8,bus=pcie.0,addr=0x2.0x7 \\-device virtio-scsi-pci-non-transitional,id=scsi0,bus=pci.2,addr=0x0 \\-device virtio-serial-pci-non-transitional,id=virtio-serial0,bus=pci.3,addr=0x0 \\-blockdev '{\"driver\":\"host_device\",\"filename\":\"/dev/datavolume\",\"aio\":\"native\",\"node-name\":\"libvirt-2-storage\",\"cache\":{\"direct\":true,\"no-flush\":false},\"auto-read-only\":true,\"discard\":\"unmap\"}' \\-blockdev '{\"node-name\":\"libvirt-2-format\",\"read-only\":false,\"discard\":\"unmap\",\"cache\":{\"direct\":true,\"no-flush\":false},\"driver\":\"raw\",\"file\":\"libvirt-2-storage\"}' \\-device virtio-blk-pci-non-transitional,bus=pci.4,addr=0x0,drive=libvirt-2-format,id=ua-datavolume,bootindex=1,write-cache=on,werror=stop,rerror=stop \\-blockdev '{\"driver\":\"file\",\"filename\":\"/var/run/kubevirt-ephemeral-disks/cloud-init-data/test/virtual-machine-nvme/configdrive.iso\",\"node-name\":\"libvirt-1-storage\",\"cache\":{\"direct\":true,\"no-flush\":false},\"auto-read-only\":true,\"discard\":\"unmap\"}' \\-blockdev '{\"node-name\":\"libvirt-1-format\",\"read-only\":false,\"discard\":\"unmap\",\"cache\":{\"direct\":true,\"no-flush\":false},\"driver\":\"raw\",\"file\":\"libvirt-1-storage\"}' \\-device virtio-blk-pci-non-transitional,bus=pci.5,addr=0x0,drive=libvirt-1-format,id=ua-cloudinitdisk,write-cache=on,werror=stop,rerror=stop \\-netdev tap,fd=21,id=hostua-default,vhost=on,vhostfd=22 \\-device virtio-net-pci-non-transitional,host_mtu=9000,netdev=hostua-default,id=ua-default,mac=52:54:00:0a:b3:7b,bus=pci.1,addr=0x0,romfile= \\-chardev socket,id=charserial0,fd=23,server=on,wait=off \\-device isa-serial,chardev=charserial0,id=serial0 \\-chardev socket,id=charchannel0,fd=24,server=on,wait=off \\-device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 \\-audiodev id=audio1,driver=none \\-vnc vnc=unix:/var/run/kubevirt-private/2c902644-a56c-48d4-91fa-a784927b0a90/virt-vnc,audiodev=audio1 \\-device VGA,id=video0,vgamem_mb=16,bus=pcie.0,addr=0x1 \\-device vfio-pci,host=0000:cb:00.0,id=ua-gpu-gpu1,bus=pci.6,addr=0x0 \\-device virtio-balloon-pci-non-transitional,id=balloon0,bus=pci.7,addr=0x0 \\-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \\-msg timestamp=on","subcomponent":"qemu","timestamp":"2022-02-25T12:50:16.643054Z"}
{"component":"virt-launcher","level":"info","msg":"Found PID for 85383141-65ab-51c9-a2d0-e9d6b0f9543d: 94","pos":"monitor.go:139","timestamp":"2022-02-25T12:50:17.468809Z"}
{"component":"virt-launcher","level":"info","msg":"GuestAgentLifecycle event state 2 with reason 1 received","pos":"client.go:490","timestamp":"2022-02-25T12:50:49.899438Z"}
{"component":"virt-launcher","level":"info","msg":"kubevirt domain status: Paused(3):StartingUp(11)","pos":"client.go:283","timestamp":"2022-02-25T12:50:49.902399Z"}
{"component":"virt-launcher","level":"info","msg":"Domain name event: test_virtual-machine-nvme","pos":"client.go:408","timestamp":"2022-02-25T12:50:49.904253Z"}
{"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 4 with reason 0 received","pos":"client.go:433","timestamp":"2022-02-25T12:50:50.100936Z"}
{"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 2 with reason 0 received","pos":"client.go:433","timestamp":"2022-02-25T12:50:50.105806Z"}
{"component":"virt-launcher","kind":"","level":"info","msg":"Domain started.","name":"virtual-machine-nvme","namespace":"test","pos":"manager.go:750","timestamp":"2022-02-25T12:50:50.107522Z","uid":"2c902644-a56c-48d4-91fa-a784927b0a90"}
{"component":"virt-launcher","level":"info","msg":"kubevirt domain status: Running(1):Unknown(1)","pos":"client.go:283","timestamp":"2022-02-25T12:50:50.108313Z"}
{"component":"virt-launcher","level":"info","msg":"Domain name event: test_virtual-machine-nvme","pos":"client.go:408","timestamp":"2022-02-25T12:50:50.111450Z"}
{"component":"virt-launcher","kind":"","level":"info","msg":"Synced vmi","name":"virtual-machine-nvme","namespace":"test","pos":"server.go:190","timestamp":"2022-02-25T12:50:50.111993Z","uid":"2c902644-a56c-48d4-91fa-a784927b0a90"}
{"component":"virt-launcher","level":"warning","msg":"MDEV_PCI_RESOURCE_NVIDIA_COM_GPU not set for resource nvidia.com/gpu","pos":"addresspool.go:50","timestamp":"2022-02-25T12:50:50.112203Z"}
{"component":"virt-launcher","level":"info","msg":"host-device created: 0000:cb:00.0","pos":"hostdev.go:79","timestamp":"2022-02-25T12:50:50.112275Z"}
{"component":"virt-launcher","level":"info","msg":"kubevirt domain status: Running(1):Unknown(1)","pos":"client.go:283","timestamp":"2022-02-25T12:50:50.114825Z"}
{"component":"virt-launcher","kind":"","level":"info","msg":"Synced vmi","name":"virtual-machine-nvme","namespace":"test","pos":"server.go:190","timestamp":"2022-02-25T12:50:50.115432Z","uid":"2c902644-a56c-48d4-91fa-a784927b0a90"}
{"component":"virt-launcher","level":"info","msg":"Domain name event: test_virtual-machine-nvme","pos":"client.go:408","timestamp":"2022-02-25T12:50:50.116613Z"}
{"component":"virt-launcher","level":"warning","msg":"MDEV_PCI_RESOURCE_NVIDIA_COM_GPU not set for resource nvidia.com/gpu","pos":"addresspool.go:50","timestamp":"2022-02-25T12:50:50.158810Z"}
{"component":"virt-launcher","level":"info","msg":"host-device created: 0000:cb:00.0","pos":"hostdev.go:79","timestamp":"2022-02-25T12:50:50.158923Z"}
{"component":"virt-launcher","kind":"","level":"info","msg":"Synced vmi","name":"virtual-machine-nvme","namespace":"test","pos":"server.go:190","timestamp":"2022-02-25T12:50:50.162269Z","uid":"2c902644-a56c-48d4-91fa-a784927b0a90"}
{"component":"virt-launcher","level":"warning","msg":"MDEV_PCI_RESOURCE_NVIDIA_COM_GPU not set for resource nvidia.com/gpu","pos":"addresspool.go:50","timestamp":"2022-02-25T12:50:50.288742Z"}
{"component":"virt-launcher","level":"info","msg":"host-device created: 0000:cb:00.0","pos":"hostdev.go:79","timestamp":"2022-02-25T12:50:50.288838Z"}
{"component":"virt-launcher","kind":"","level":"info","msg":"Synced vmi","name":"virtual-machine-nvme","namespace":"test","pos":"server.go:190","timestamp":"2022-02-25T12:50:50.292652Z","uid":"2c902644-a56c-48d4-91fa-a784927b0a90"}
{"component":"virt-launcher","level":"warning","msg":"MDEV_PCI_RESOURCE_NVIDIA_COM_GPU not set for resource nvidia.com/gpu","pos":"addresspool.go:50","timestamp":"2022-02-25T12:50:50.314774Z"}
{"component":"virt-launcher","level":"info","msg":"host-device created: 0000:cb:00.0","pos":"hostdev.go:79","timestamp":"2022-02-25T12:50:50.314866Z"}
{"component":"virt-launcher","kind":"","level":"info","msg":"Synced vmi","name":"virtual-machine-nvme","namespace":"test","pos":"server.go:190","timestamp":"2022-02-25T12:50:50.318719Z","uid":"2c902644-a56c-48d4-91fa-a784927b0a90"}
{"component":"virt-launcher","level":"info","msg":"Process 85383141-65ab-51c9-a2d0-e9d6b0f9543d and pid 94 is a zombie, sending SIGCHLD to pid 1 to reap process","pos":"monitor.go:155","timestamp":"2022-02-25T12:50:52.467304Z"}
{"component":"virt-launcher","level":"info","msg":"Waiting on final notifications to be sent to virt-handler.","pos":"virt-launcher.go:277","timestamp":"2022-02-25T12:50:52.467404Z"}
{"component":"virt-launcher","level":"info","msg":"Reaped pid 0 with status 0","pos":"virt-launcher.go:549","timestamp":"2022-02-25T12:50:52.467556Z"}
{"component":"virt-launcher","level":"error","msg":"internal error: End of file from qemu monitor","pos":"qemuMonitorIO:582","subcomponent":"libvirt","thread":"95","timestamp":"2022-02-25T12:50:53.813000Z"}
{"component":"virt-launcher","level":"info","msg":"Reaped pid 94 with status 9","pos":"virt-launcher.go:549","timestamp":"2022-02-25T12:50:56.147688Z"}
{"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 5 with reason 5 received","pos":"client.go:433","timestamp":"2022-02-25T12:50:56.220935Z"}
{"component":"virt-launcher","level":"info","msg":"kubevirt domain status: Shutoff(5):Crashed(3)","pos":"client.go:283","timestamp":"2022-02-25T12:50:56.225347Z"}
{"component":"virt-launcher","level":"info","msg":"Domain name event: test_virtual-machine-nvme","pos":"client.go:408","timestamp":"2022-02-25T12:50:56.227309Z"}
{"component":"virt-launcher","kind":"","level":"info","msg":"Domain undefined.","name":"virtual-machine-nvme","namespace":"test","pos":"manager.go:1252","timestamp":"2022-02-25T12:50:56.256561Z","uid":"2c902644-a56c-48d4-91fa-a784927b0a90"}
{"component":"virt-launcher","kind":"","level":"info","msg":"Signaled vmi deletion","name":"virtual-machine-nvme","namespace":"test","pos":"server.go:312","timestamp":"2022-02-25T12:50:56.256661Z","uid":"2c902644-a56c-48d4-91fa-a784927b0a90"}
{"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 1 with reason 0 received","pos":"client.go:433","timestamp":"2022-02-25T12:50:56.256663Z"}
{"component":"virt-launcher","level":"info","msg":"Domain name event: ","pos":"client.go:408","timestamp":"2022-02-25T12:50:56.257771Z"}
{"component":"virt-launcher","level":"info","msg":"Final Delete notification sent","pos":"virt-launcher.go:292","timestamp":"2022-02-25T12:50:56.257828Z"}
{"component":"virt-launcher","level":"info","msg":"stopping cmd server","pos":"server.go:547","timestamp":"2022-02-25T12:50:56.257859Z"}
{"component":"virt-launcher","level":"info","msg":"Received signal terminated","pos":"virt-launcher.go:490","timestamp":"2022-02-25T12:50:56.334042Z"}
{"component":"virt-launcher","level":"error","msg":"timeout on stopping the cmd server, continuing anyway.","pos":"server.go:558","timestamp":"2022-02-25T12:50:57.258369Z"}
{"component":"virt-launcher","level":"info","msg":"Exiting...","pos":"virt-launcher.go:519","timestamp":"2022-02-25T12:50:57.258496Z"}
{"component":"virt-launcher","level":"info","msg":"Reaped pid 25 with status 0","pos":"virt-launcher.go:549","timestamp":"2022-02-25T12:50:57.266174Z"}
{"component":"virt-launcher","level":"error","msg":"error when checking for istio-proxy presence","pos":"virt-launcher.go:657","reason":"Get \"http://localhost:15021/healthz/ready\": dial tcp 127.0.0.1:15021: connect: no route to host","timestamp":"2022-02-25T12:51:00.092210Z"}

Environment:

dhiller commented 2 years ago

Does this happen also when there are no GPUs involved? We would like to narrow down when this is happening exactly?

sergeimonakhov commented 2 years ago

Does this happen also when there are no GPUs involved? We would like to narrow down when this is happening exactly?

Without gpu no problem any memory size

dhiller commented 2 years ago

@vladikr do you have an idea how memory size could be related to gpu?

sergeimonakhov commented 2 years ago

Hi! do you have any ideas? I can run additional tests

sergeimonakhov commented 2 years ago

@dhiller @vladikr

xpivarc commented 2 years ago

Hi @D1abloRUS , I think logs with higher verbosity might be helpful. Also, can you reproduce this with plain qemu/outside of Kubevirt?

sergeimonakhov commented 2 years ago

Hi, @xpivarc @vladikr The problem is related to the oom on qemu-kvm, I had to allocate more memory for qemu-kvm. I have compiled a table. There are correlations between the number of GPUs and RAM and how much memory should be left for qemu-kvm so that it does not crash, do you have any ideas what this might be related to?

sergeimonakhov commented 2 years ago

There is also a problem, no matter how much memory I leave to the guest, the vm with vfio does not start if the memory is more than 480gb

xpivarc commented 2 years ago

Hi @D1abloRUS I will try to look into it closely. Let me recap to be sure I understand. The first problem was that our overhead calculation seems to be wrong when multiple gpus are requested? Why does it not start with 480gb, is it the same issue, and do you have a plain qemu reference?

vladikr commented 2 years ago

Hi, I somehow missed this issue. very sorry for my late reply. In general, qemu/libvirt and consequently KubeVirt adds a 1GB "fudge factor" to the memory overheads and lock it. Unfortunately, a manual adjustment to this overhead calculation is necessary when multiple vfio devices are present, This is also dependent on the assigned device itself; we've seen some devices consume more qemu memory than others.

Regarding, the 480gb limits - I didn't know that such limit exist. I'll look into that.

vladikr commented 2 years ago

btw @D1abloRUS, when you're allocating >480gb - are you running a single VMI per node? Are you allocating RAM or are these hugepages?

sergeimonakhov commented 2 years ago

@vladikr hi,

are you running a single VMI per node?

Yes

Are you allocating RAM or are these hugepages?

RAM

kubevirt-bot commented 1 year ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

kubevirt-bot commented 1 year ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

kubevirt-bot commented 1 year ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

/close

kubevirt-bot commented 1 year ago

@kubevirt-bot: Closing this issue.

In response to [this](https://github.com/kubevirt/kubevirt/issues/7279#issuecomment-1356281227): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
xpivarc commented 1 year ago

/reopen /remove-lifecycle rotten

kubevirt-bot commented 1 year ago

@xpivarc: Reopened this issue.

In response to [this](https://github.com/kubevirt/kubevirt/issues/7279#issuecomment-1358954410): >/reopen >/remove-lifecycle rotten > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
kubevirt-bot commented 1 year ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

kubevirt-bot commented 1 year ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

kubevirt-bot commented 1 year ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

/close

kubevirt-bot commented 1 year ago

@kubevirt-bot: Closing this issue.

In response to [this](https://github.com/kubevirt/kubevirt/issues/7279#issuecomment-1554242139): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
xpivarc commented 1 year ago

/reopen /remove-lifecycle rotten

kubevirt-bot commented 1 year ago

@xpivarc: Reopened this issue.

In response to [this](https://github.com/kubevirt/kubevirt/issues/7279#issuecomment-1554478935): >/reopen >/remove-lifecycle rotten Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
tranrn commented 1 year ago

Same with Centos 9 KVM (libvirtd (libvirt) 9.3.0) We have 2 CPU 4 GPU server with 512 GB RAM. When VM used 4 vfio GPU and 200Gb RAM, it worked fine. When we gave it 480 Gb RAM VM became unusable after kvm backup script (just snapshot and qcow copy)

kubevirt-bot commented 1 year ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

kubevirt-bot commented 1 year ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

kubevirt-bot commented 11 months ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

/close

kubevirt-bot commented 11 months ago

@kubevirt-bot: Closing this issue.

In response to [this](https://github.com/kubevirt/kubevirt/issues/7279#issuecomment-1805287811): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
niuplayer commented 4 months ago

hi @sergeimonakhov ,Have you found a solution to the problem above: Unable to start when there are more than 3 GPUs?

dhruvik7 commented 2 months ago

I'm having a similar issue! when dedicatedCpuPlacement is enabled, can't seem to get a VM with more than 1 GPU to start

xpivarc commented 4 weeks ago

https://github.com/kubevirt/kubevirt/issues/12565#issuecomment-2327756761 is valid workaround