kimchi-project / gingerbase

Gingerbase: basic host management for WoK
Other
47 stars 25 forks source link

SMT : 'GINSMT0010E: Error occurred in fetching smt status' after I disable a vCPU by command #156

Closed mesmriti closed 7 years ago

mesmriti commented 7 years ago

=== Problem Description ===================================

===========================================================

I disable a vCPU on S231KP11 which was SMT2 mode before. When I refresh the hvm browser, error 'GINSMT0010E: Error occurred in fetching smt status' happens. When I open SMT edit panel, 'Current SMT Settings' is unable to grab the data and 'Persisted SMT Settings' has been changed to SMT1 automatically. So I guess there are 2 problems here:

Problem 1:

  1. 'Current SMT Settings' is unable to grab the data

Before I disable 1 vCPU of the system.

[root@s231kp11 ~]# lscpu Architecture: s390x CPU op-mode(s): 32-bit, 64-bit Byte Order: Big Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s) per book: 3 Book(s): 2 NUMA node(s): 1 Vendor ID: IBM/S390 BogoMIPS: 17006.00 Hypervisor: PR/SM Hypervisor vendor: IBM Virtualization type: full Dispatching mode: horizontal L1d cache: 128K L1i cache: 96K L2d cache: 2048K L2i cache: 2048K NUMA node0 CPU(s): 0-39

disable 1 vCPU:

[root@s231kp11 ~]# chcpu -d 7 CPU 7 disabled

[root@s231kp11 ~]# lscpu Architecture: s390x CPU op-mode(s): 32-bit, 64-bit Byte Order: Big Endian CPU(s): 8 On-line CPU(s) list: 0-6 Off-line CPU(s) list: 7 Thread(s) per core: 1 Core(s) per socket: 8 Socket(s) per book: 3 Book(s): 2 NUMA node(s): 1 Vendor ID: IBM/S390 BogoMIPS: 17006.00 Hypervisor: PR/SM Hypervisor vendor: IBM Virtualization type: full Dispatching mode: horizontal L1d cache: 128K L1i cache: 96K L2d cache: 2048K L2i cache: 2048K NUMA node0 CPU(s): 0-39

You can see 'Threads per core' changes from 2 to 1 inside the OS. This problem has already open a defect against Zeus BaseOS image. Here is the link:

https://bugzilla.linux.ibm.com/show_bug.cgi?id=146672

I throw this question here for two purpose: a. Verify it when BaseOS image problem fixed. b. What will the HVM does when BaseOS has not been fixed yet? Leave it blank as HVM currently does? Or change to SMT1 as the internal OS does?

Problem 2:

  1. For 'Persisted SMT Settings', why does it change to SMT1? Original is SMT2. For value of 'Petrsisted SMT Settings', I guess hvm captures setting file data on the OS. But even I have disabled 1 vCPU, the smt settings inside /etc/zipl.conf has not changed. It still shows smt2 mode. I don't know whether you grab the persisted data from this file or not. I know when changes made in this file and then reboot the system, the system will change SMT mode.

[root@s231kp11 ~]# cat /etc/zipl.conf [defaultboot] default=4.4.0-45.66.el7_2.kvmibm1_1_3.1.s390x target=/boot [4.4.0-45.66.el7_2.kvmibm1_1_3.1.s390x] image=/boot/vmlinuz-4.4.0-45.66.el7_2.kvmibm1_1_3.1.s390x parameters="rd.zfcp=0.0.9000,0x5001738030bb0151,0x0001000000000000 rd.lvm.lv=s231kp11-zkvmvg/root root=/dev/mapper/s231kp11--zkvmvg-root vconsole.keymap=us elevator=deadline zfcp.no_auto_port_rescan=0 pci=on zfcp.allow_lun_scan=1 LANG=en_US.utf8 rd.zfcp=0.0.9000,0x5001738030bb0141,0x0001000000000000 vconsole.font=latarcyrheb-sun16 crashkernel=512M smt=2" ramdisk=/boot/initramfs-4.4.0-45.66.el7_2.kvmibm1_1_3.1.s390x.img [4.4.0-40.60.el7_2.kvmibm1_1_3.2.s390x] image=/boot/vmlinuz-4.4.0-40.60.el7_2.kvmibm1_1_3.2.s390x parameters="rd.zfcp=0.0.9000,0x5001738030bb0151,0x0001000000000000 rd.lvm.lv=s231kp11-zkvmvg/root root=/dev/mapper/s231kp11--zkvmvg-root vconsole.keymap=us elevator=deadline zfcp.no_auto_port_rescan=0 pci=on zfcp.allow_lun_scan=1 LANG=en_US.utf8 rd.zfcp=0.0.9000,0x5001738030bb0141,0x0001000000000000 vconsole.font=latarcyrheb-sun16 crashkernel=512M smt=2" ramdisk=/boot/initramfs-4.4.0-40.60.el7_2.kvmibm1_1_3.2.s390x.img [4.4.0-40.60.el7_2.kvmibm1_1_3.1.s390x] image=/boot/vmlinuz-4.4.0-40.60.el7_2.kvmibm1_1_3.1.s390x parameters="rd.zfcp=0.0.9000,0x5001738030bb0151,0x0001000000000000 rd.lvm.lv=s231kp11-zkvmvg/root root=/dev/mapper/s231kp11--zkvmvg-root vconsole.keymap=us elevator=deadline zfcp.no_auto_port_rescan=0 pci=on zfcp.allow_lun_scan=1 LANG=en_US.utf8 rd.zfcp=0.0.9000,0x5001738030bb0141,0x0001000000000000 vconsole.font=latarcyrheb-sun16 crashkernel=512M smt=2" ramdisk=/boot/initramfs-4.4.0-40.60.el7_2.kvmibm1_1_3.1.s390x.img [3.10.0-229.7.2.el7_1.kvmibm1_1_1.20.s390x] image=/boot/vmlinuz-3.10.0-229.7.2.el7_1.kvmibm1_1_1.20.s390x parameters="rd.zfcp=0.0.9000,0x5001738030bb0151,0x0001000000000000 rd.lvm.lv=s231kp11-zkvmvg/root root=/dev/mapper/s231kp11--zkvmvg-root vconsole.keymap=us elevator=deadline zfcp.no_auto_port_rescan=0 pci=on zfcp.allow_lun_scan=1 LANG=en_US.utf8 rd.zfcp=0.0.9000,0x5001738030bb0141,0x0001000000000000 vconsole.font=latarcyrheb-sun16 crashkernel=512M smt=2" ramdisk=/boot/initramfs-3.10.0-229.7.2.el7_1.kvmibm1_1_1.20.s390x.img [3.10.0-229.7.2.el7_1.kvmibm1_1_1.16.s390x] image=/boot/vmlinuz-3.10.0-229.7.2.el7_1.kvmibm1_1_1.16.s390x parameters="rd.zfcp=0.0.9000,0x5001738030bb0151,0x0001000000000000 rd.lvm.lv=s231kp11-zkvmvg/root root=/dev/mapper/s231kp11--zkvmvg-root vconsole.keymap=us elevator=deadline zfcp.no_auto_port_rescan=0 pci=on zfcp.allow_lun_scan=1 LANG=en_US.utf8 rd.zfcp=0.0.9000,0x5001738030bb0141,0x0001000000000000 vconsole.font=latarcyrheb-sun16 crashkernel=512M smt=2" ramdisk=/boot/initramfs-3.10.0-229.7.2.el7_1.kvmibm1_1_1.16.s390x.img [root@s231kp11 ~]#

You can see 'smt=2'. But hvm 'Persisted SMT Settings' has already changed to SMT1.But when I restart S231KP11, HVM persisted setting changes back to SMT2, not SMT1 as it shows before I restart.

mesmriti commented 7 years ago

Tested the scenario mentioned and came to a conclusion that this is very exceptional scenario observed. Since from the backend point of view we are checking the threads per core value as well for getting the SMT status this particular scenario throws an exception with threads per core value being 1 and output of /proc/cmdline containing smt=2.

As part of the fix we can remove the check for threads per core while assigning the SMT status and get the status from /proc/cmdline output.