virtio-win / kvm-guest-drivers-windows

Windows paravirtualized drivers for QEMU\KVM
https://www.linux-kvm.org/page/WindowsGuestDrivers
BSD 3-Clause "New" or "Revised" License
2.04k stars 387 forks source link

Memballoon - possible cause for slow shutdown. "deflation" issue #1148

Open DaLiV opened 1 month ago

DaLiV commented 1 month ago

Describe the bug at/after shutdown of guest os when use less than :

problem occurs not (shutdown of VM takes some seconds) in cases when: 1 <currentMemory> equal to <memory> 2 memballoon driver disabled in windows 3 memballoon disabled on libvirt with "model=none"

  1. prior to shutdown do "full memory allocation" to VM virsh setmem VMName --live --size 32G sure - that op also takes time - approx 90sec. but direct shutdown "without that" takes 180sec. that means 2 times less, shutdown done afterwards in "below 5 seconds"

seems memballoon allocate memory at shutdown when deallocation processes running,

To Reproduce

  1. set in vm.xml ...
    <domain type='kvm'>
    <memory unit='KiB'>33554432</memory>
    <currentMemory unit='KiB'>4194304</currentMemory>
    <memoryBacking>
    <source type='memfd'/>
    <access mode='shared'/>
    </memoryBacking>
  2. start VM
  3. Stop VM

Expected behavior Shutdown must not take long time.

Host:


**VM:**
 - Windows 11
 - memballoon
 - 100.95.104.26200 / virtio-win-0.1.262.iso

**Additional context**
can monitor next at shutdown time:
watch -n 1 "virsh dommemstat VMName"
there grows "rss" till MaxMem, but very slowly
YanVugenfirer commented 1 month ago
  1. prior to shutdown do "full memory allocation" to VM Do you mean to infalte the balloon to the full memory of the VM? This is not recommended action in any case that might lead to system failure.
DaLiV commented 1 month ago

4 - i showed how that was done virsh setmem VMName --live --size 32G that allocate from partial usage "4Gb" dynamic to full 32G what is defined for this VM. standart defined command, standart "not-recomended" behaviour. That test firstly is done for more cleraly understand where is possible fault persist. all of ways 1/2/3/4 shows the same direction - dynamic memory + it's ballooning ... long shutdown time is first symptom. timing from (4) lead to think that half of this time is used for same "full memory allocation" with parallel "dealloction" at shutdown, yes - that can prevent from OOMs, but at cost of "time" , which by many VMs will be multiplied ...

simple example for upgrade of host: so - you have running 10 such VMs with dynamic allocation and need shutdown all of them

P.S. swap usage=0, nothing goes there in case of some will say "you swapping at this time" what is not "recommended" P.P.S even if "slow dymanic memory allocation" can be improved from "90 sec" to some "5sec" that will be also usable (then "shutdown-crutch-script" may be used as permanent "solution", as that has "VM-Importance-order") P.P.P.S in case of dynamic underprovisioning to "2Gb" cpu usage also constantly "high" - that is additional point in this subsystem (but possible also related to the same part of code) .

xiagao commented 1 month ago

@DaLiV Win11 guest indeed has this problem with balloon device, it took almost twice as long as WS2022 guest. There is already a jira issue recorded internally. If there is any update, we'll update here.