Closed sergeimonakhov closed 11 months ago
Does this happen also when there are no GPUs involved? We would like to narrow down when this is happening exactly?
Does this happen also when there are no GPUs involved? We would like to narrow down when this is happening exactly?
Without gpu no problem any memory size
@vladikr do you have an idea how memory size could be related to gpu?
Hi! do you have any ideas? I can run additional tests
@dhiller @vladikr
Hi @D1abloRUS , I think logs with higher verbosity might be helpful. Also, can you reproduce this with plain qemu/outside of Kubevirt?
Hi, @xpivarc @vladikr The problem is related to the oom on qemu-kvm, I had to allocate more memory for qemu-kvm. I have compiled a table. There are correlations between the number of GPUs and RAM and how much memory should be left for qemu-kvm so that it does not crash, do you have any ideas what this might be related to?
There is also a problem, no matter how much memory I leave to the guest, the vm with vfio does not start if the memory is more than 480gb
Hi @D1abloRUS I will try to look into it closely. Let me recap to be sure I understand. The first problem was that our overhead calculation seems to be wrong when multiple gpus are requested? Why does it not start with 480gb, is it the same issue, and do you have a plain qemu reference?
Hi, I somehow missed this issue. very sorry for my late reply. In general, qemu/libvirt and consequently KubeVirt adds a 1GB "fudge factor" to the memory overheads and lock it. Unfortunately, a manual adjustment to this overhead calculation is necessary when multiple vfio devices are present, This is also dependent on the assigned device itself; we've seen some devices consume more qemu memory than others.
Regarding, the 480gb limits - I didn't know that such limit exist. I'll look into that.
btw @D1abloRUS, when you're allocating >480gb - are you running a single VMI per node? Are you allocating RAM or are these hugepages?
@vladikr hi,
are you running a single VMI per node?
Yes
Are you allocating RAM or are these hugepages?
RAM
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
/close
@kubevirt-bot: Closing this issue.
/reopen /remove-lifecycle rotten
@xpivarc: Reopened this issue.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
/close
@kubevirt-bot: Closing this issue.
/reopen /remove-lifecycle rotten
@xpivarc: Reopened this issue.
Same with Centos 9 KVM (libvirtd (libvirt) 9.3.0) We have 2 CPU 4 GPU server with 512 GB RAM. When VM used 4 vfio GPU and 200Gb RAM, it worked fine. When we gave it 480 Gb RAM VM became unusable after kvm backup script (just snapshot and qcow copy)
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
/close
@kubevirt-bot: Closing this issue.
hi @sergeimonakhov ,Have you found a solution to the problem above: Unable to start when there are more than 3 GPUs?
I'm having a similar issue! when dedicatedCpuPlacement is enabled, can't seem to get a VM with more than 1 GPU to start
https://github.com/kubevirt/kubevirt/issues/12565#issuecomment-2327756761 is valid workaround
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
VM does not start if you use more 128Gi RAM and 1/2/3/N GPU. Everything works correctly on 128Gi.
What you expected to happen:
VM works correctly with GPU and RAM larger than 128Gi*
How to reproduce it (as minimally and precisely as possible): Create VM with 1 GPU and 129Gi ram or 2/3/4 GPU and 128Gi ram.
Anything else we need to know?: logs:
Environment:
virtctl version
): 0.47.1kubectl version
): 1.19.7uname -a
):Linux node5 5.13.0-28-generic #31~20.04.1-Ubuntu SMP Wed Jan 19 14:08:10 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux