Closed sathnaga closed 6 years ago
ERROR 107-guest_sanity.hotplug.memory.qemu.qcow2.virtio_scsi.smp2.virtio_net.CentOS.7.4.ppc64le.powerkvm-libvirt.libvirt_mem.positive_test.hot_plug -> LoginTimeoutError: Login timeout expired```
@sathnaga I wasn't able to hit this issue with the following setup:
Host:
cpu: Power8 qemu-kvm 2.10.0.3.dev.gitbf0fd83.el7.centos kernel 4.13.0-4.dev.git49564cb.el7.centos.ppc64le libvirt 3.6.0.3.dev.gitdd9401b.el7.centos
Guest:
Ubuntu 1704 kernel 4.10.0-28
The guest is restarted back at the same point it was before firing virsh managedsave, including the hotplugged LMBs .
Can you send the logs from the VM (/var/log/libvirt/qemu/
@danielhb I tried with latest levels again and able to hit the issue
Host: 4.15.0-3.dev.gitd34a158.el7.centos.ppc64le qemu-2.11.50-1.dev.gita815ffa.el7.centos.ppc64le libvirt-4.0.0-2.dev.git5e6f8a1.el7.centos.ppc64le
Guest: 3.10.0-693.5.2.el7.ppc64le (Centos 7.4) 4.15.0-3.dev.gitd34a158.el7.centos.ppc64le (hostos)
2018-02-08 07:23:39.147+0000: starting up libvirt version: 4.0.0, package: 2.dev.git5e6f8a1.el7.centos (Unknown, 2018-01-21-15:28:09, baratheon), qemu version: 2.11.50, hostname: ltc-test-ci2.aus.stglabs.ibm.com
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=none /usr/bin/qemu-kvm -name guest=vm1,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-18-vm1/master-key.aes -machine pseries-2.12,accel=kvm,usb=off,dump-guest-core=off -m size=33554432k,slots=16,maxmem=67108864k -realtime mlock=off -smp 32,sockets=32,cores=1,threads=1 -numa node,nodeid=0,cpus=0-31,mem=32768 -uuid 3e8d0ad8-39d0-408f-ad7f-0ea5c6f31012 -display none -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-18-vm1/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device qemu-xhci,id=usb,bus=pci.0,addr=0x3 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x2 -drive file=/home/sath/hostos-ppc64le.qcow2,format=qcow2,if=none,id=drive-scsi0-0-0-0 -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 -netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=28 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:b3:b4:b5,bus=pci.0,addr=0x1 -chardev pty,id=charserial0 -device spapr-vty,chardev=charserial0,id=serial0,reg=0x30000000 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -msg timestamp=on
2018-02-08T07:23:39.273386Z qemu-system-ppc64: -chardev pty,id=charserial0: char device redirected to /dev/pts/1 (label charserial0)
2018-02-08 07:25:47.520+0000: shutting down, reason=saved
2018-02-08T07:25:47.524256Z qemu-system-ppc64: terminating on signal 15 from pid 46977 (<unknown process>)
2018-02-08 07:25:56.185+0000: starting up libvirt version: 4.0.0, package: 2.dev.git5e6f8a1.el7.centos (Unknown, 2018-01-21-15:28:09, baratheon), qemu version: 2.11.50, hostname: ltc-test-ci2.aus.stglabs.ibm.com
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=none /usr/bin/qemu-kvm -name guest=vm1,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-19-vm1/master-key.aes -machine pseries-2.12,accel=kvm,usb=off,dump-guest-core=off -m size=33554432k,slots=16,maxmem=67108864k -realtime mlock=off -smp 32,sockets=32,cores=1,threads=1 -numa node,nodeid=0,cpus=0-31,mem=32768 -object memory-backend-ram,id=memdimm0,size=8589934592 -device pc-dimm,node=0,memdev=memdimm0,id=dimm0,slot=0,addr=34359738368 -uuid 3e8d0ad8-39d0-408f-ad7f-0ea5c6f31012 -display none -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-19-vm1/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device qemu-xhci,id=usb,bus=pci.0,addr=0x3 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x2 -drive file=/home/sath/hostos-ppc64le.qcow2,format=qcow2,if=none,id=drive-scsi0-0-0-0 -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 -netdev tap,fd=27,id=hostnet0,vhost=on,vhostfd=29 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:b3:b4:b5,bus=pci.0,addr=0x1 -chardev pty,id=charserial0 -device spapr-vty,chardev=charserial0,id=serial0,reg=0x30000000 -incoming defer -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -msg timestamp=on
2018-02-08T07:25:56.316388Z qemu-system-ppc64: -chardev pty,id=charserial0: char device redirected to /dev/pts/1 (label charserial0)
(process:99500): GLib-CRITICAL **: g_source_unref: assertion 'source != NULL' failed
(process:99500): GLib-CRITICAL **: g_source_unref: assertion 'source != NULL' failed
(process:99500): GLib-CRITICAL **: g_source_unref: assertion 'source != NULL' failed
(process:99500): GLib-CRITICAL **: g_source_unref: assertion 'source != NULL' failed
(process:99500): GLib-CRITICAL **: g_source_unref: assertion 'source != NULL' failed
(process:99500): GLib-CRITICAL **: g_source_unref: assertion 'source != NULL' failed
(process:99500): GLib-CRITICAL **: g_source_unref: assertion 'source != NULL' failed
(process:99500): GLib-CRITICAL **: g_source_unref: assertion 'source != NULL' failed
(process:99500): GLib-CRITICAL **: g_source_unref: assertion 'source != NULL' failed
(process:99500): GLib-CRITICAL **: g_source_unref: assertion 'source != NULL' failed
(process:99500): GLib-CRITICAL **: g_source_unref: assertion 'source != NULL' failed
(process:99500): GLib-CRITICAL **: g_source_unref: assertion 'source != NULL' failed
(process:99500): GLib-CRITICAL **: g_source_unref: assertion 'source != NULL' failed
(process:99500): GLib-CRITICAL **: g_source_unref: assertion 'source != NULL' failed
(process:99500): GLib-CRITICAL **: g_source_unref: assertion 'source != NULL' failed
(process:99500): GLib-CRITICAL **: g_source_unref: assertion 'source != NULL' failed
(process:99500): GLib-CRITICAL **: g_source_unref: assertion 'source != NULL' failed
------- Comment From danielhb@br.ibm.com 2018-02-08 10:04:39 EDT------- I am still not able to reproduce this bug, although my enviroment is a bit different:
Host Pegas1.1 running P9 DD2.0
Guest: Pegas 1.1 kernel 4.14.0-37.el7a.ppc64le
Guest XML:
<domain type='kvm'> <name>dhb-memhotplug</name> <uuid>7fda2bdb-a32b-4f84-9b89-868774195fc1</uuid> <maxMemory slots='16' unit='KiB'>16777216</maxMemory> <memory unit='KiB'>4194304</memory> <currentMemory unit='KiB'>4194304</currentMemory> <vcpu placement='static'>8</vcpu> <resource> <partition>/machine</partition> </resource> <os> <type arch='ppc64le' machine='pseries-2.12'>hvm</type> <boot dev='hd'/> </os> <clock offset='utc'/> <on_poweroff>destroy</on_poweroff> <on_reboot>restart</on_reboot> <on_crash>destroy</on_crash> <devices> <emulator>/usr/local/bin/qemu-system-ppc64</emulator> <disk type='file' device='disk'> <driver name='qemu' type='qcow2' cache='none'/> <source file='/home/danielhb/vm_imgs/pegas1.1.qcow2'/> <target dev='sda' bus='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/> </disk> <controller type='scsi' index='0' model='virtio-scsi'> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </controller> <controller type='usb' index='0' model='qemu-xhci'> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </controller> <controller type='pci' index='0' model='pci-root'> <model name='spapr-pci-host-bridge'/> <target index='0'/> </controller> <interface type='bridge'> <mac address='52:54:00:84:85:86'/> <source bridge='virbr0'/> <model type='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> </interface> <serial type='pty'> <target type='spapr-vio-serial' port='0'> <model name='spapr-vty'/> </target> <address type='spapr-vio' reg='0x30000000'/> </serial> <console type='pty'> <target type='serial' port='0'/> <address type='spapr-vio' reg='0x30000000'/> </console> <memballoon model='virtio'> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> </memballoon> <panic model='pseries'/> </devices> <seclabel type='none' model='selinux'/> <seclabel type='none' model='dac'/> </domain>
This is the output right after the VM is started:
localhost login: root Password: Last login: Thu Feb 8 07:17:59 on hvc0 [root@localhost ~]# [root@localhost ~]# [root@localhost ~]# [root@localhost ~]# grep Mem /proc/meminfo MemTotal: 3633408 kB MemFree: 3180928 kB MemAvailable: 3014016 kB [root@localhost ~]# [ 71.728923] pseries-hotplug-mem: Attempting to hot-add 16 LMB(s) at index 80000010 [ 71.765142] radix-mmu: Mapped 0xc000000100000000-0xc000000110000000 with 2.00 MiB pages [ 71.767171] Built 1 zonelists, mobility grouping on. Total pages: 56655 [ 71.767242] Policy zone: Normal [ 71.767477] radix-mmu: Mapped 0xc000000110000000-0xc000000120000000 with 2.00 MiB pages [ 71.770559] radix-mmu: Mapped 0xc000000120000000-0xc000000130000000 with 2.00 MiB pages [ 71.772844] radix-mmu: Mapped 0xc000000130000000-0xc000000140000000 with 2.00 MiB pages [ 71.775209] radix-mmu: Mapped 0xc000000140000000-0xc000000150000000 with 2.00 MiB pages [ 71.777407] radix-mmu: Mapped 0xc000000150000000-0xc000000160000000 with 2.00 MiB pages [ 71.779665] radix-mmu: Mapped 0xc000000160000000-0xc000000170000000 with 2.00 MiB pages [ 71.781897] radix-mmu: Mapped 0xc000000170000000-0xc000000180000000 with 2.00 MiB pages [ 71.784211] radix-mmu: Mapped 0xc000000180000000-0xc000000190000000 with 2.00 MiB pages [ 71.786375] radix-mmu: Mapped 0xc000000190000000-0xc0000001a0000000 with 2.00 MiB pages [ 71.788558] radix-mmu: Mapped 0xc0000001a0000000-0xc0000001b0000000 with 2.00 MiB pages [ 71.791183] radix-mmu: Mapped 0xc0000001b0000000-0xc0000001c0000000 with 2.00 MiB pages [ 71.794673] radix-mmu: Mapped 0xc0000001c0000000-0xc0000001d0000000 with 2.00 MiB pages [ 71.797431] radix-mmu: Mapped 0xc0000001d0000000-0xc0000001e0000000 with 2.00 MiB pages [ 71.799781] radix-mmu: Mapped 0xc0000001e0000000-0xc0000001f0000000 with 2.00 MiB pages [ 71.803004] radix-mmu: Mapped 0xc0000001f0000000-0xc000000200000000 with 2.00 MiB pages [ 71.805369] pseries-hotplug-mem: Memory at 100000000 (drc index 80000010) was hot-added [ 71.805502] pseries-hotplug-mem: Memory at 110000000 (drc index 80000011) was hot-added [ 71.805631] pseries-hotplug-mem: Memory at 120000000 (drc index 80000012) was hot-added [ 71.805769] pseries-hotplug-mem: Memory at 130000000 (drc index 80000013) was hot-added [ 71.805907] pseries-hotplug-mem: Memory at 140000000 (drc index 80000014) was hot-added [ 71.806038] pseries-hotplug-mem: Memory at 150000000 (drc index 80000015) was hot-added [ 71.806186] pseries-hotplug-mem: Memory at 160000000 (drc index 80000016) was hot-added [ 71.806324] pseries-hotplug-mem: Memory at 170000000 (drc index 80000017) was hot-added [ 71.806451] pseries-hotplug-mem: Memory at 180000000 (drc index 80000018) was hot-added [ 71.806585] pseries-hotplug-mem: Memory at 190000000 (drc index 80000019) was hot-added [ 71.806716] pseries-hotplug-mem: Memory at 1a0000000 (drc index 8000001a) was hot-added [ 71.806847] pseries-hotplug-mem: Memory at 1b0000000 (drc index 8000001b) was hot-added [ 71.806970] pseries-hotplug-mem: Memory at 1c0000000 (drc index 8000001c) was hot-added [ 71.807097] pseries-hotplug-mem: Memory at 1d0000000 (drc index 8000001d) was hot-added [ 71.807226] pseries-hotplug-mem: Memory at 1e0000000 (drc index 8000001e) was hot-added [ 71.807353] pseries-hotplug-mem: Memory at 1f0000000 (drc index 8000001f) was hot-added
$ $ sudo virsh managedsave dhb-memhotplug
Domain dhb-memhotplug state saved by libvirt
$ $ sudo virsh start --console dhb-memhotplug Domain dhb-memhotplug started Connected to domain dhb-memhotplug Escape character is ^]
[root@localhost ~]# [root@localhost ~]# [root@localhost ~]# grep Mem /proc/meminfo MemTotal: 7827712 kB MemFree: 7363520 kB MemAvailable: 7497984 kB [root@localhost ~]#
Tried about 10 times in the hopes that the problem might be intermittent. Still no luck - the VM is restarted back even after doing a hotplug.
I'll try to simulate with bigger amounts of memory being hotplugged to see if there is a difference. I'll also try with a Host OS guest as the one used in the bug.
------- Comment From danielhb@br.ibm.com 2018-02-08 14:37:40 EDT------- Thanks for lending the env, Satheesh.
I was able to reproduce the error in Satheesh's env and also in another Power8E environment I have access (also running Host OS).
I am not too familiar with how virsh managedsave works - at first I thought it was a savevm call from QEMU, but savevm takes way longer than managedsave to execute. I'll see how can I reproduce the exact behavior of virsh managedsave using QEMU only.
------- Comment From danielhb@br.ibm.com 2018-02-13 12:44:21 EDT------- This problem will occur when using HPT guests with newer kernels that has the htab resize feature. This feature allows the guest kernel to resize its HPT based on the new memory configuration. This resize can happen when hotplugging or unplugging memory.
The problem is in the QEMU side. QEMU isn't setting the new HTAB size properly after the guest kernel resized it. This does not affect the guest immediately, but any loadvm operation (like the situation described here or even in a migration) will trigger the bug, making the guest unresponsive.
This impacts all guests that uses HPT (i.e. everyone but P9 Radix).
I've sent a fix proposal to the QEMU mailing list.
------- Comment From danielhb@br.ibm.com 2018-02-16 11:25:39 EDT------- The fix was accepted and is already available upstream:
commit 9478956794c11239b7c1c3ef9ce95c883bb839a3 Author: Daniel Henrique Barboza <danielhb@linux.vnet.ibm.com> Date: Tue Feb 13 15:37:16 2018 -0200
hw/ppc/spapr_hcall: set htab_shift after kvmppc_resize_hpt_commit
I'll move this to FAT. You can test it right now by using QEMU upstream - it is not available on Host OS yet.
And let me emphasize my comment #10: this affects all HPT guests running 4.14+ kernels. I think this is a good candidate to be backported to Pegas.
------- Comment From bssrikanth@in.ibm.com 2018-02-21 00:22:36 EDT------- Satheesh will clone this to Pegas. What's the timeline to get this on Hostos?
Tested and found working in the latest builds.
4.16.0-2.dev.gitb24758c.el7.centos.ppc64le qemu-2.11.91-1.dev.git46a0689.el7.centos.ppc64le libvirt-4.2.0-2.dev.gitbf217de.el7.centos.ppc64le
https://ltc-jenkins.aus.stglabs.ibm.com/job/runAvocadoFVTTest/1336/artifact/avocado-fvt-wrapper/results/job-2018-05-07T01.24-a1877cd/test-results/103-guest_sanity.hotplug.memory.qemu.qcow2.virtio_scsi.smp2.virtio_net.Guest.CentOS.7.4.ppc64le.powerkvm-libvirt.libvirt_mem.positive_test.memory.hot.plug/debug.log
<maxMemory slots="16" unit="KiB">67108864</maxMemory><cpu><numa><cell cpus="0-31" id="0" memory="33554432" unit="KiB" /></numa></cpu></domain>
02:06:59 DEBUG| Undefine VM virt-tests-vm1
02:06:59 DEBUG| Define VM from /var/tmp/xml_utils_temp_93u5HX.xml
02:06:59 DEBUG| Starting vm 'virt-tests-vm1'
02:07:01 DEBUG| waiting for domain virt-tests-vm1 to start (0.000019 secs)
02:07:02 DEBUG| Attempting to log into 'virt-tests-vm1' (timeout 240s)
02:07:02 DEBUG| Found/Verified IP 192.168.122.2 for VM virt-tests-vm1 NIC 0
02:07:27 DEBUG| Sending command: cat /proc/meminfo
02:07:27 DEBUG| Memtotal on guest: 32449984
02:07:27 DEBUG| Memory device xml: <?xml version='1.0' encoding='UTF-8'?>
<memory model="dimm"><target><size unit="KiB">8388608</size><node>0</node></target></memory>
02:07:27 DEBUG| Command output:
Device attached successfully
02:07:27 DEBUG| Attempting to log into 'virt-tests-vm1' (timeout 240s)
02:07:27 DEBUG| Found/Verified IP 192.168.122.2 for VM virt-tests-vm1 NIC 0
02:07:29 DEBUG| Sending command: grep memory /lib/udev/rules.d/80-hotplug-cpu-mem.rules || echo 'SUBSYSTEM=="memory", ACTION=="add", TEST=="state", ATTR{state}=="offline", ATTR{state}="online"' >> /lib/udev/rules.d/80-hotplug-cpu-mem.rules
02:07:30 DEBUG| Sending command: echo $?
02:07:45 DEBUG| Sending command: cat /proc/meminfo
02:07:45 DEBUG| Sending command: cat /proc/meminfo
02:07:45 DEBUG| Memtotal on guest: 40838592
02:07:45 DEBUG| Running virsh command: managedsave --domain virt-tests-vm1
02:07:45 INFO | Running '/bin/virsh managedsave --domain virt-tests-vm1 '
02:07:50 DEBUG| [stdout]
02:07:51 INFO | Command '/bin/virsh managedsave --domain virt-tests-vm1 ' finished with 0 after 5.41704893112s
02:07:51 DEBUG| [stdout] Domain virt-tests-vm1 state saved by libvirt
02:07:51 DEBUG| [stdout]
02:07:51 DEBUG| status: 0
02:07:51 DEBUG| stdout: Domain virt-tests-vm1 state saved by libvirt
02:07:51 DEBUG| stderr:
02:07:51 DEBUG| Command output:
Domain virt-tests-vm1 state saved by libvirt
02:07:51 DEBUG| Starting vm 'virt-tests-vm1'
02:07:58 DEBUG| waiting for domain virt-tests-vm1 to start (0.000014 secs)
02:07:58 DEBUG| Attempting to log into 'virt-tests-vm1' (timeout 240s)
02:07:58 DEBUG| Found/Verified IP 192.168.122.2 for VM virt-tests-vm1 NIC 0
02:08:00 DEBUG| Running virsh command: save virt-tests-vm1 /var/tmp/avocado_ET4mb7/avocado_job_nNtNnP/virt-tests-vm1.save
02:08:00 INFO | Running '/bin/virsh save virt-tests-vm1 /var/tmp/avocado_ET4mb7/avocado_job_nNtNnP/virt-tests-vm1.save '
02:08:06 DEBUG| [stdout]
02:08:06 INFO | Command '/bin/virsh save virt-tests-vm1 /var/tmp/avocado_ET4mb7/avocado_job_nNtNnP/virt-tests-vm1.save ' finished with 0 after 6.07074785233s
02:08:06 DEBUG| [stdout] Domain virt-tests-vm1 saved to /var/tmp/avocado_ET4mb7/avocado_job_nNtNnP/virt-tests-vm1.save
02:08:06 DEBUG| [stdout]
02:08:06 DEBUG| status: 0
02:08:06 DEBUG| stdout: Domain virt-tests-vm1 saved to /var/tmp/avocado_ET4mb7/avocado_job_nNtNnP/virt-tests-vm1.save
02:08:06 DEBUG| stderr:
02:08:06 DEBUG| Command output:
Domain virt-tests-vm1 saved to /var/tmp/avocado_ET4mb7/avocado_job_nNtNnP/virt-tests-vm1.save
02:08:06 DEBUG| Running virsh command: restore /var/tmp/avocado_ET4mb7/avocado_job_nNtNnP/virt-tests-vm1.save
02:08:06 INFO | Running '/bin/virsh restore /var/tmp/avocado_ET4mb7/avocado_job_nNtNnP/virt-tests-vm1.save '
02:08:14 DEBUG| [stdout] Domain restored from /var/tmp/avocado_ET4mb7/avocado_job_nNtNnP/virt-tests-vm1.save
02:08:14 DEBUG| [stdout]
02:08:14 INFO | Command '/bin/virsh restore /var/tmp/avocado_ET4mb7/avocado_job_nNtNnP/virt-tests-vm1.save ' finished with 0 after 7.98436999321s
02:08:14 DEBUG| status: 0
02:08:14 DEBUG| stdout: Domain restored from /var/tmp/avocado_ET4mb7/avocado_job_nNtNnP/virt-tests-vm1.save
02:08:14 DEBUG| stderr:
02:08:14 DEBUG| Command output:
Domain restored from /var/tmp/avocado_ET4mb7/avocado_job_nNtNnP/virt-tests-vm1.save
02:08:14 DEBUG| Attempting to log into 'virt-tests-vm1' (timeout 240s)
02:08:14 DEBUG| Found/Verified IP 192.168.122.2 for VM virt-tests-vm1 NIC 0
02:08:16 INFO | Running 'true'```