Closed lxwinspur closed 1 year ago
the Bmc dump file:
@anoo1 @geissonator @mzipse FYI
The phenomenon on the GUI is as follows:
Server power operations: The Host status is Power On
, but it is not actually power on, and the Power icon on the Bar is gray
Firmware: Refresh FW not allowed
Inventory and LEDs: NULL
Sensors: NULL
Hardware deconfiguration: NULL
PCIe hardware topology: NULL
@mzipse
But these problems will disappear after doing AC a few more times.
This is a very serious problem, please ask IBM experts to take a look, thanks.
@SunnySrivastava1984 was looking at this issue but didn't find the root cause on the original dumps, tagging him to see if more data is present on this new dump that was uploaded.
Analyzed the dumps and logs. Few things that can be tried and re-check a) The image being used is a "dirty" image, its difficult to find out what changed in the image. Please use released image. I could find many services dumping in the logs attached. b) The PEL which says "Input power was lost while the system was powered on."
a) The image being used is a "dirty" image, its difficult to find out what changed in the image. Please use released image. I could find many services dumping in the logs attached.
We have used the image of the ips branch that IBM released to us, and the test results are the same. I just added the following patch to the ips branch https://github.com/ibm-openbmc/phosphor-bmc-code-mgmt/commit/d98d4a8a7d291287815e0bd8584dd310f1e7ed84
b) The PEL which says "Input power was lost while the system was powered on."
Sure, I will tell our test team, But I learned from them that they used the same script to test IBM's FW, and the power loss problem also occurred, but it did not affect the boot of the Host, and the abnormal information recovered after host power on.
but it did not affect the boot of the Host, and the abnormal information recovered after host power on.
Ok, Please let me know if your team hits the issue with modified script as well. It would be really helpful if you can generate dump specifically at the time of failure.
but it did not affect the boot of the Host, and the abnormal information recovered after host power on.
Ok, Please let me know if your team hits the issue with modified script as well. It would be really helpful if you can generate dump specifically at the time of failure.
Sure, Thanks @SunnySrivastava1984 Our test team has modified the script and is testing it, so far everything works fine.
@SunnySrivastava1984 Our test team has modified the script, unfortunately, this error reappeared on the 20th test Attached is the error log: AC_log_2023.02.17.tar.gz
@SunnySrivastava1984 Our test team has modified the script, unfortunately, this error reappeared on the 20th test Attached is the error log: AC_log_2023.02.17.tar.gz
Dump Analysis. From attached event log file in the dump(AC_log_2023.02.17/event_logs_2023-02-17_08-51-11.txt), It appears that VPD Manager and IBM panel service crashed on 16th Feb. "Private Header": { 4719 "Section Version": "1", 4720 "Sub-section type": "0", 4721 "Created by": "0x3400", 4722 "Created at": "02/16/2023 18:15:19", 4723 "Committed at": "02/16/2023 18:15:19", 4724 "Creator Subsystem": "BMC", 4725 "CSSVER": "", 4726 "Platform Log Id": "0x500074F7", 4727 "Entry Id": "0x500074F7", 4728 "BMC Event Log Id": "1499"
4813 "User Data 1": { 4814 "Section Version": "1", 4815 "Sub-section type": "1", 4816 "Created by": "0x2000", 4817 "SYSTEMD_RESULT": "failed", 4818 "SYSTEMD_UNIT": "com.ibm.VPD.Manager.service"MD_RESULT": "failed",
Where as "AC_log_2023.02.17/BMCDUMP.0000000.00000245.20230217084714/journal-pretty.log” has entry starting from 17th Feb. Below is the first entry in journal { "SYSLOG_PID" : "14968", "_SYSTEMD_UNIT" : "dropbear@3030-192.168.1.65:22-192.168.1.111:43094.service", "__REALTIME_TIMESTAMP" : "1676623635010323", "PRIORITY" : "4", "MESSAGE" : "pam_ibmacf(dropbear:auth): ACF service auth failed 0x6: SerialNumberMismatch (serial=UNSET, sRc1=0x6, sRc2=0xffffffff, sRc3=0xFFFFFFFF, sRc4=0xFFFFFFFF)", "_SOURCE_REALTIME_TIMESTAMP" : "1676623635010232", "_GID" : "0", "_UID" : "0", "_EXE" : "/usr/sbin/dropbearmulti", "_MACHINE_ID" : "7623891df88f4612862fc6c351a09980", "_SYSTEMD_CGROUP" : "/system.slice/system-dropbear.slice/dropbear@3030-192.168.1.65:22-192.168.1.111:43094.service", "SYSLOG_IDENTIFIER" : "dropbear", "SYSLOG_FACILITY" : "10", "_SYSTEMD_INVOCATION_ID" : "79585eae9b78494db980eaaa0a6ed18b", "MONOTONIC_TIMESTAMP" : "52376949600", "_CAP_EFFECTIVE" : "1ffffffffff", "_PID" : "14968", "SYSLOG_TIMESTAMP" : "Feb 17 08:47:14 ", "_SYSTEMD_SLICE" : "system-dropbear.slice", "_TRANSPORT" : "syslog", "_COMM" : "dropbear", "_HOSTNAME" : "p10bmc", "_CMDLINE" : "/usr/sbin/dropbear -i -r /var/lib/dropbear/dropbear_rsa_host_key -B -G shellaccess -I 3600", "CURSOR" : "s=f2fbdefe184d4736b86cab87e72f1e60;i=d4e1;b=eaa1a929b1b34600a7f93f2e516863ad;m=c31e8d360;t=5f4e15c2fd313;x=65e309375b1e8ad9", "_BOOT_ID" : "eaa1a929b1b34600a7f93f2e516863ad" } Hence there is no entry in journal with respect to VPD manager service or ibm panel service. I would suggest to please modify your script such that the script stops when it encounters the error and the BMC state would be quiesced. We would require the dump to be taken in this scenario.
After updating and using the following solution, the problem is solved
https://github.com/ibm-openbmc/openbmc/commit/2a0c1837053f01c748d838b72185073dd75baf07
Pre-condition:
AC Cycle steps:
About ten times, it will fail to host power on in the sixth step, the following is the event log: