ibm-openbmc / openbmc

https://github.com
Other
19 stars 51 forks source link

1030.10.ips: `Input power was lost` appeared probalilistically after AC #278

Closed lxwinspur closed 1 year ago

lxwinspur commented 1 year ago

Pre-condition:

  1. The server power cable is connect to a network power controller to do AC on/off through network.
  2. Create a LPAR and install OS.
  3. Enable option “Automatically start when the managed system is powered on” in LPAR profile.

AC Cycle steps:

  1. A script is executed on a client to monitor and control server power status.
  2. Power on server and wait 6 minutes to power off server with command “obmcutil poweroff” in BMC console.
  3. When script detects the host is powered off, send command to the network power controller to do AC off.
  4. After 30 seconds, send command to do AC on.
  5. Wait 3 minutes for BMC to be ready, and then send command “obmcutil poweron” in BMC console to power on host.
  6. Then server boots to runtime and LPAR boots to OS then power off again.
  7. Repeat step2-6

event Log:

{
"Private Header": {
    "Section Version":          "1",
    "Sub-section type":         "0",
    "Created by":               "0x2700",
    "Created at":               "03/23/2023 02:23:11",
    "Committed at":             "03/23/2023 02:23:12",
    "Creator Subsystem":        "BMC",
    "CSSVER":                   "",
    "Platform Log Id":          "0x50001AB7",
    "Entry Id":                 "0x50001AB7",
    "BMC Event Log Id":         "2619"
},
"User Header": {
    "Section Version":          "1",
    "Sub-section type":         "0",
    "Log Committed by":         "0x2000",
    "Subsystem":                "Power/Cooling",
    "Event Scope":              "Entire Platform",
    "Event Severity":           "Critical Error, Scope of Failure unknown",
    "Event Type":               "Not Applicable",
    "Action Flags": [
                                "Service Action Required",
                                "Report Externally"
    ],
    "Host Transmission":        "Acked",
    "HMC Transmission":         "Acked"
},
"Primary SRC": {
    "Section Version":          "1",
    "Sub-section type":         "1",
    "Created by":               "0x2700",
    "SRC Version":              "0x02",
    "SRC Format":               "0x55",
    "Virtual Progress SRC":     "False",
    "I5/OS Service Event Bit":  "False",
    "Hypervisor Dump Initiated":"False",
    "Backplane CCIN":           "2E2F",
    "Terminate FW Error":       "False",
    "Deconfigured":             "False",
    "Guarded":                  "False",
    "Error Details": {
        "Message":              "Input power was lost while the system was powered on."
    },
    "Valid Word Count":         "0x09",
    "Reference Code":           "110000AC",
    "Hex Word 2":               "00080055",
    "Hex Word 3":               "2E2F0010",
    "Hex Word 4":               "00000000",
    "Hex Word 5":               "00000000",
    "Hex Word 6":               "00000000",
    "Hex Word 7":               "00000000",
    "Hex Word 8":               "00000000",
    "Hex Word 9":               "00000000",
    "Callout Section": {
        "Callout Count":        "1",
        "Callouts": [{
            "FRU Type":         "Symbolic FRU",
            "Priority":         "Mandatory, replace all with this type as a unit",
            "Part Number":      "ACMODUL"
        }]
    }
},
"Extended User Header": {
    "Section Version":          "1",
    "Sub-section type":         "0",
    "Created by":               "0x2000",
    "Reporting Machine Type":   "9105-42A",
    "Reporting Serial Number":  "783C4E1",
    "FW Released Ver":          "PL1030_045",
    "FW SubSys Version":        "fw1030.10-17.4",
    "Common Ref Time":          "00/00/0000 00:00:00",
    "Symptom Id Len":           "20",
    "Symptom Id":               "110000AC_2E2F0010"
},
"Failing MTMS": {
    "Section Version":          "1",
    "Sub-section type":         "0",
    "Created by":               "0x2000",
    "Machine Type Model":       "9105-42A",
    "Serial Number":            "783C4E1"
},
"User Data 0": {
    "Section Version": "1",
    "Sub-section type": "1",
    "Created by": "0x2000",
    "BMCLoad": "3.41 0.94 0.32",
    "BMCState": "NotReady",
    "BMCUptime": "0y 0d 0h 1m 10s",
    "BootState": "",
    "ChassisState": "",
    "FW Version ID": "fw1030.10-17.4-ips-1030.2307.20230307i-prod (PL1030_045)",
    "HostState": "",
    "Process Name": "/usr/bin/phosphor-chassis-state-manager",
    "System IM": "50001000"
},
"User Data 1": {
    "Section Version": "1",
    "Sub-section type": "1",
    "Created by": "0x2000",
    "_PID": "1805"
}
}
lxwinspur commented 1 year ago

I suspect that it is caused by monitoring the Ac fault or PGood fault signal of the PSU after power on. https://github.com/ibm-openbmc/phosphor-power/blob/1050/phosphor-power-supply/psu_manager.cpp#L812-L835

@spinler @mzipse FYI

spinler commented 1 year ago
When script detects the host is powered off, send command to the network power controller to do AC off.

If I had to guess, your script is cutting AC before the chassis is actually off, or at least before chassis state manager gets a chance to persist the new chassis power state. That code is all in phosphor-state-manager/chassis_state_manager.cpp.

JerryInspur commented 1 year ago
When script detects the host is powered off, send command to the network power controller to do AC off.

If I had to guess, your script is cutting AC before the chassis is actually off, or at least before chassis state manager gets a chance to persist the new chassis power state. That code is all in phosphor-state-manager/chassis_state_manager.cpp.

Hi We have tried add a 20 seconds time interval before AC off, this error is not seen any more. Why this only happened on 4 PSUs S1024, but not on 2 PSUs S1022?

spinler commented 1 year ago

@JerryInspur If you wanted, you could watch that LastStateChangeTime property to see what was going on.