ibm-openbmc / openbmc

https://github.com
Other
19 stars 51 forks source link

1030.ips branch: Host Power On fails after several AC cycles #261

Closed lxwinspur closed 1 year ago

lxwinspur commented 1 year ago

Pre-condition:

  1. The server power cable is connected to a network power controller to do AC on/off through the network.
  2. Create an LPAR and install OS.
  3. Enable the option “Automatically start when the managed system is powered on” in the LPAR profile.
  4. Install the HTX tool in LPAR and execute “hcl –bootme on mode:hardf period:1” to shut down LPAR in every 20 minutes.
  5. Enable the system option “Power off when the last logical partition is shutdown”.
  6. Setup system “partition start policy” to “Auto-start always”.

AC Cycle steps:

  1. A script is executed on a client to monitor and control server power status.
  2. LPAR shutdown and then host shutdown.
  3. The script keeps monitoring power status with command “obmcutil status” in BMC console.
  4. When the script detects the host is powered off, send a command to the network power controller to do AC off.
  5. After 30 seconds, send a command to do AC on.
  6. Wait 3 minutes for BMC to be ready, and then send the command “obmcutil poweron” in the BMC console to power on the host.
  7. Then the server boots to runtime and LPAR boots to OS then bootme shutdown works again.
  8. Repeat steps 2-7.

About ten times, it will fail to host power on in the sixth step, the following is the event log:

{
"Private Header": {
    "Section Version":          "1",
    "Sub-section type":         "0",
    "Created by":               "0x2700",
    "Created at":               "02/04/2023 04:25:46",
    "Committed at":             "02/04/2023 04:25:46",
    "Creator Subsystem":        "BMC",
    "CSSVER":                   "",
    "Platform Log Id":          "0x50000112",
    "Entry Id":                 "0x50000112",
    "BMC Event Log Id":         "539"
},
"User Header": {
    "Section Version":          "1",
    "Sub-section type":         "0",
    "Log Committed by":         "0x2000",
    "Subsystem":                "Power/Cooling",
    "Event Scope":              "Entire Platform",
    "Event Severity":           "Critical Error, Scope of Failure unknown",
    "Event Type":               "Not Applicable",
    "Action Flags": [
                                "Service Action Required",
                                "Report Externally"
    ],
    "Host Transmission":        "Not Sent",
    "HMC Transmission":         "Acked"
},
"Primary SRC": {
    "Section Version":          "1",
    "Sub-section type":         "1",
    "Created by":               "0x2700",
    "SRC Version":              "0x02",
    "SRC Format":               "0x55",
    "Virtual Progress SRC":     "False",
    "I5/OS Service Event Bit":  "False",
    "Hypervisor Dump Initiated":"False",
    "Backplane CCIN":           "2E2D",
    "Terminate FW Error":       "False",
    "Deconfigured":             "False",
    "Guarded":                  "False",
    "Error Details": {
        "Message":              "Input power was lost while the system was powered on."
    },
    "Valid Word Count":         "0x09",
    "Reference Code":           "110000AC",
    "Hex Word 2":               "00000055",
    "Hex Word 3":               "2E2D0010",
    "Hex Word 4":               "00000000",
    "Hex Word 5":               "00000000",
    "Hex Word 6":               "00000000",
    "Hex Word 7":               "00000000",
    "Hex Word 8":               "00000000",
    "Hex Word 9":               "00000000",
    "Callout Section": {
        "Callout Count":        "1",
        "Callouts": [{
            "FRU Type":         "Symbolic FRU",
            "Priority":         "Mandatory, replace all with this type as a unit",
            "Part Number":      "ACMODUL"
        }]
    }
},
"Extended User Header": {
    "Section Version":          "1",
    "Sub-section type":         "0",
    "Created by":               "0x2000",
    "Reporting Machine Type":   "9105-22B",
    "Reporting Serial Number":  "783C511",
    "FW Released Ver":          "PL1030_026",
    "FW SubSys Version":        "fw1030.00.ips-2",
    "Common Ref Time":          "00/00/0000 00:00:00",
    "Symptom Id Len":           "20",
    "Symptom Id":               "110000AC_2E2D0010"
},
"Failing MTMS": {
    "Section Version":          "1",
    "Sub-section type":         "0",
    "Created by":               "0x2000",
    "Machine Type Model":       "9105-22B",
    "Serial Number":            "783C511"
},
"User Data 0": {
    "Section Version": "1",
    "Sub-section type": "1",
    "Created by": "0x2000",
    "BMCLoad": "1.33 0.33 0.11",
    "BMCState": "NotReady",
    "BMCUptime": "0y 0d 0h 0m 37s",
    "BootState": "",
    "ChassisState": "",
    "FW Version ID": "fw1030.00.ips-25-1030.2249.20221130i-prod (PL1030_026)",
    "HostState": "",
    "Process Name": "/usr/bin/phosphor-chassis-state-manager",
    "System IM": "50001001"
},
"User Data 1": {
    "Section Version": "1",
    "Sub-section type": "1",
    "Created by": "0x2000",
    "_PID": "551"
}
}
,{
"Private Header": {
    "Section Version":          "1",
    "Sub-section type":         "0",
    "Created by":               "0x3400",
    "Created at":               "02/04/2023 04:46:22",
    "Committed at":             "02/04/2023 04:46:22",
    "Creator Subsystem":        "BMC",
    "CSSVER":                   "",
    "Platform Log Id":          "0x50000129",
    "Entry Id":                 "0x50000129",
    "BMC Event Log Id":         "584"
},
"User Header": {
    "Section Version":          "1",
    "Sub-section type":         "0",
    "Log Committed by":         "0x2000",
    "Subsystem":                "BMC Firmware",
    "Event Scope":              "Entire Platform",
    "Event Severity":           "Critical Error, Scope of Failure unknown",
    "Event Type":               "Not Applicable",
    "Action Flags": [
                                "Service Action Required",
                                "Report Externally",
                                "HMC Call Home"
    ],
    "Host Transmission":        "Not Sent",
    "HMC Transmission":         "Not Sent"
},
"Primary SRC": {
    "Section Version":          "1",
    "Sub-section type":         "1",
    "Created by":               "0x3400",
    "SRC Version":              "0x02",
    "SRC Format":               "0x55",
    "Virtual Progress SRC":     "False",
    "I5/OS Service Event Bit":  "False",
    "Hypervisor Dump Initiated":"False",
    "Backplane CCIN":           "",
    "Terminate FW Error":       "False",
    "Deconfigured":             "False",
    "Guarded":                  "False",
    "Error Details": {
        "Message":              "A critical BMC application has failed on the system, the BMC is in an undefined state"
    },
    "Valid Word Count":         "0x09",
    "Reference Code":           "BD8D3404",
    "Hex Word 2":               "00080055",
    "Hex Word 3":               "00000010",
    "Hex Word 4":               "00000000",
    "Hex Word 5":               "00000000",
    "Hex Word 6":               "00000000",
    "Hex Word 7":               "00000000",
    "Hex Word 8":               "00000000",
    "Hex Word 9":               "00000000",
    "Callout Section": {
        "Callout Count":        "1",
        "Callouts": [{
            "FRU Type":         "Maintenance Procedure Required",
            "Priority":         "Mandatory, replace all with this type as a unit",
            "Procedure":        "BMC0002"
        }]
    }
},
"Extended User Header": {
    "Section Version":          "1",
    "Sub-section type":         "0",
    "Created by":               "0x2000",
    "Reporting Machine Type":   "",
    "Reporting Serial Number":  "",
    "FW Released Ver":          "PL1030_026",
    "FW SubSys Version":        "fw1030.00.ips-2",
    "Common Ref Time":          "00/00/0000 00:00:00",
    "Symptom Id Len":           "20",
    "Symptom Id":               "BD8D3404_00000010"
},
"Failing MTMS": {
    "Section Version":          "1",
    "Sub-section type":         "0",
    "Created by":               "0x2000",
    "Machine Type Model":       "",
    "Serial Number":            ""
},
"User Data 0": {
    "Section Version": "1",
    "Sub-section type": "1",
    "Created by": "0x2000",
    "BMCLoad": "1.45 0.45 0.16",
    "BMCState": "NotReady",
    "BMCUptime": "0y 0d 0h 1m 2s",
    "BootState": "Unspecified",
    "ChassisState": "Off",
    "FW Version ID": "fw1030.00.ips-25-1030.2249.20221130i-prod (PL1030_026)",
    "HostState": "Off",
    "System IM": ""
},
"User Data 1": {
    "Section Version": "1",
    "Sub-section type": "1",
    "Created by": "0x2000",
    "SYSTEMD_RESULT": "failed",
    "SYSTEMD_UNIT": "com.ibm.VPD.Manager.service"
}
}
,{
"Private Header": {
    "Section Version":          "1",
    "Sub-section type":         "0",
    "Created by":               "0x3400",
    "Created at":               "02/04/2023 04:46:22",
    "Committed at":             "02/04/2023 04:46:22",
    "Creator Subsystem":        "BMC",
    "CSSVER":                   "",
    "Platform Log Id":          "0x5000012A",
    "Entry Id":                 "0x5000012A",
    "BMC Event Log Id":         "585"
},
"User Header": {
    "Section Version":          "1",
    "Sub-section type":         "0",
    "Log Committed by":         "0x2000",
    "Subsystem":                "BMC Firmware",
    "Event Scope":              "Entire Platform",
    "Event Severity":           "Critical Error, Scope of Failure unknown",
    "Event Type":               "Not Applicable",
    "Action Flags": [
                                "Service Action Required",
                                "Report Externally",
                                "HMC Call Home"
    ],
    "Host Transmission":        "Not Sent",
    "HMC Transmission":         "Not Sent"
},
"Primary SRC": {
    "Section Version":          "1",
    "Sub-section type":         "1",
    "Created by":               "0x3400",
    "SRC Version":              "0x02",
    "SRC Format":               "0x55",
    "Virtual Progress SRC":     "False",
    "I5/OS Service Event Bit":  "False",
    "Hypervisor Dump Initiated":"False",
    "Backplane CCIN":           "",
    "Terminate FW Error":       "False",
    "Deconfigured":             "False",
    "Guarded":                  "False",
    "Error Details": {
        "Message":              "A critical BMC application has failed on the system, the BMC is in an undefined state"
    },
    "Valid Word Count":         "0x09",
    "Reference Code":           "BD8D3404",
    "Hex Word 2":               "00080055",
    "Hex Word 3":               "00000010",
    "Hex Word 4":               "00000000",
    "Hex Word 5":               "00000000",
    "Hex Word 6":               "00000000",
    "Hex Word 7":               "00000000",
    "Hex Word 8":               "00000000",
    "Hex Word 9":               "00000000",
    "Callout Section": {
        "Callout Count":        "1",
        "Callouts": [{
            "FRU Type":         "Maintenance Procedure Required",
            "Priority":         "Mandatory, replace all with this type as a unit",
            "Procedure":        "BMC0002"
        }]
    }
},
"Extended User Header": {
    "Section Version":          "1",
    "Sub-section type":         "0",
    "Created by":               "0x2000",
    "Reporting Machine Type":   "",
    "Reporting Serial Number":  "",
    "FW Released Ver":          "PL1030_026",
    "FW SubSys Version":        "fw1030.00.ips-2",
    "Common Ref Time":          "00/00/0000 00:00:00",
    "Symptom Id Len":           "20",
    "Symptom Id":               "BD8D3404_00000010"
},
"Failing MTMS": {
    "Section Version":          "1",
    "Sub-section type":         "0",
    "Created by":               "0x2000",
    "Machine Type Model":       "",
    "Serial Number":            ""
},
"User Data 0": {
    "Section Version": "1",
    "Sub-section type": "1",
    "Created by": "0x2000",
    "BMCLoad": "1.45 0.45 0.16",
    "BMCState": "Quiesced",
    "BMCUptime": "0y 0d 0h 1m 2s",
    "BootState": "Unspecified",
    "ChassisState": "Off",
    "FW Version ID": "fw1030.00.ips-25-1030.2249.20221130i-prod (PL1030_026)",
    "HostState": "Off",
    "System IM": ""
},
"User Data 1": {
    "Section Version": "1",
    "Sub-section type": "1",
    "Created by": "0x2000",
    "SYSTEMD_RESULT": "failed",
    "SYSTEMD_UNIT": "com.ibm.panel.service"
}
}
lxwinspur commented 1 year ago

the Bmc dump file:

dump.tar.gz

lxwinspur commented 1 year ago

@anoo1 @geissonator @mzipse FYI

lxwinspur commented 1 year ago

The phenomenon on the GUI is as follows:

Overview

Operations

Reource management

Hardware status

Security and access

lxwinspur commented 1 year ago

EventLog_BmcDump.tar.gz

lxwinspur commented 1 year ago

@mzipse

But these problems will disappear after doing AC a few more times.

This is a very serious problem, please ask IBM experts to take a look, thanks.

anoo1 commented 1 year ago

@SunnySrivastava1984 was looking at this issue but didn't find the root cause on the original dumps, tagging him to see if more data is present on this new dump that was uploaded.

SunnySrivastava1984 commented 1 year ago

Analyzed the dumps and logs. Few things that can be tried and re-check a) The image being used is a "dirty" image, its difficult to find out what changed in the image. Please use released image. I could find many services dumping in the logs attached. b) The PEL which says "Input power was lost while the system was powered on."

lxwinspur commented 1 year ago

a) The image being used is a "dirty" image, its difficult to find out what changed in the image. Please use released image. I could find many services dumping in the logs attached.

We have used the image of the ips branch that IBM released to us, and the test results are the same. I just added the following patch to the ips branch https://github.com/ibm-openbmc/phosphor-bmc-code-mgmt/commit/d98d4a8a7d291287815e0bd8584dd310f1e7ed84

b) The PEL which says "Input power was lost while the system was powered on."

Sure, I will tell our test team, But I learned from them that they used the same script to test IBM's FW, and the power loss problem also occurred, but it did not affect the boot of the Host, and the abnormal information recovered after host power on.

SunnySrivastava1984 commented 1 year ago

but it did not affect the boot of the Host, and the abnormal information recovered after host power on.

Ok, Please let me know if your team hits the issue with modified script as well. It would be really helpful if you can generate dump specifically at the time of failure.

lxwinspur commented 1 year ago

but it did not affect the boot of the Host, and the abnormal information recovered after host power on.

Ok, Please let me know if your team hits the issue with modified script as well. It would be really helpful if you can generate dump specifically at the time of failure.

Sure, Thanks @SunnySrivastava1984 Our test team has modified the script and is testing it, so far everything works fine.

lxwinspur commented 1 year ago

@SunnySrivastava1984 Our test team has modified the script, unfortunately, this error reappeared on the 20th test Attached is the error log: AC_log_2023.02.17.tar.gz

SunnySrivastava1984 commented 1 year ago

@SunnySrivastava1984 Our test team has modified the script, unfortunately, this error reappeared on the 20th test Attached is the error log: AC_log_2023.02.17.tar.gz

Dump Analysis. From attached event log file in the dump(AC_log_2023.02.17/event_logs_2023-02-17_08-51-11.txt), It appears that VPD Manager and IBM panel service crashed on 16th Feb. 
 "Private Header": { 4719 "Section Version": "1", 4720 "Sub-section type": "0", 4721 "Created by": "0x3400", 4722 "Created at": "02/16/2023 18:15:19", 4723 "Committed at": "02/16/2023 18:15:19", 4724 "Creator Subsystem": "BMC", 4725 "CSSVER": "", 4726 "Platform Log Id": "0x500074F7", 4727 "Entry Id": "0x500074F7", 4728 "BMC Event Log Id": "1499"

4813 "User Data 1": { 4814 "Section Version": "1", 4815 "Sub-section type": "1", 4816 "Created by": "0x2000", 4817 "SYSTEMD_RESULT": "failed", 4818 "SYSTEMD_UNIT": "com.ibm.VPD.Manager.service"MD_RESULT": "failed",

Where as "AC_log_2023.02.17/BMCDUMP.0000000.00000245.20230217084714/journal-pretty.log” has entry starting from 17th Feb. 
Below is the first entry in journal

{ "SYSLOG_PID" : "14968", "_SYSTEMD_UNIT" : "dropbear@3030-192.168.1.65:22-192.168.1.111:43094.service", "__REALTIME_TIMESTAMP" : "1676623635010323", "PRIORITY" : "4", "MESSAGE" : "pam_ibmacf(dropbear:auth): ACF service auth failed 0x6: SerialNumberMismatch (serial=UNSET, sRc1=0x6, sRc2=0xffffffff, sRc3=0xFFFFFFFF, sRc4=0xFFFFFFFF)", "_SOURCE_REALTIME_TIMESTAMP" : "1676623635010232", "_GID" : "0", "_UID" : "0", "_EXE" : "/usr/sbin/dropbearmulti", "_MACHINE_ID" : "7623891df88f4612862fc6c351a09980", "_SYSTEMD_CGROUP" : "/system.slice/system-dropbear.slice/dropbear@3030-192.168.1.65:22-192.168.1.111:43094.service", "SYSLOG_IDENTIFIER" : "dropbear", "SYSLOG_FACILITY" : "10", "_SYSTEMD_INVOCATION_ID" : "79585eae9b78494db980eaaa0a6ed18b", "MONOTONIC_TIMESTAMP" : "52376949600", "_CAP_EFFECTIVE" : "1ffffffffff", "_PID" : "14968", "SYSLOG_TIMESTAMP" : "Feb 17 08:47:14 ", "_SYSTEMD_SLICE" : "system-dropbear.slice", "_TRANSPORT" : "syslog", "_COMM" : "dropbear", "_HOSTNAME" : "p10bmc", "_CMDLINE" : "/usr/sbin/dropbear -i -r /var/lib/dropbear/dropbear_rsa_host_key -B -G shellaccess -I 3600", "CURSOR" : "s=f2fbdefe184d4736b86cab87e72f1e60;i=d4e1;b=eaa1a929b1b34600a7f93f2e516863ad;m=c31e8d360;t=5f4e15c2fd313;x=65e309375b1e8ad9", "_BOOT_ID" : "eaa1a929b1b34600a7f93f2e516863ad" } 
 Hence there is no entry in journal with respect to VPD manager service or ibm panel service. I would suggest to please modify your script such that the script stops when it encounters the error and the BMC state would be quiesced. We would require the dump to be taken in this scenario.

lxwinspur commented 1 year ago

After updating and using the following solution, the problem is solved

https://github.com/ibm-openbmc/openbmc/commit/2a0c1837053f01c748d838b72185073dd75baf07