open-power / hostboot

System initialization firmware for Power systems
Apache License 2.0
74 stars 97 forks source link

Power10 small core cpu checkstop #224

Closed lili-lilili closed 1 year ago

lili-lilili commented 1 year ago

When we test a Power10 small core CPU on the rainier, CPU will checkstop due to "L3 Dir read UE". But if we use a Big core CPU, there is no problem. Is there anything to need special attention when using small core cpu? Like some attribute settings, or some config?

Here is the event log: { "Private Header": { "Section Version": "1", "Sub-section type": "0", "Created by": "0xE500", "Created at": "05/17/2023 05:48:45", "Committed at": "05/17/2023 05:48:45", "Creator Subsystem": "BMC", "CSSVER": "", "Platform Log Id": "0x50000011", "Entry Id": "0x50000011", "BMC Event Log Id": "243" }, "User Header": { "Section Version": "1", "Sub-section type": "0", "Log Committed by": "0x2000", "Subsystem": "Processor Unit (CPU)", "Event Scope": "Entire Platform", "Event Severity": "Unrecoverable Error", "Event Type": "Not Applicable", "Action Flags": [ "Service Action Required", "Report Externally", "HMC Call Home" ], "Host Transmission": "Not Sent", "HMC Transmission": "Not Sent" }, "Primary SRC": { "Section Version": "1", "Sub-section type": "1", "Created by": "0xE500", "SRC Version": "0x02", "SRC Format": "0x55", "Virtual Progress SRC": "False", "I5/OS Service Event Bit": "False", "Hypervisor Dump Initiated":"False", "Backplane CCIN": "2E2F", "Terminate FW Error": "False", "Deconfigured": "False", "Guarded": "True", "Error Details": { "Message": "Error Signature: 0x20DA0020 0x00020001 0x5074140E" }, "Valid Word Count": "0x09", "Reference Code": "BD13E510", "Hex Word 2": "00000055", "Hex Word 3": "2E2F0010", "Hex Word 4": "00000000", "Hex Word 5": "01000000", "Hex Word 6": "20DA0020", "Hex Word 7": "00020001", "Hex Word 8": "5074140E", "Hex Word 9": "00000000", "Callout Section": { "Callout Count": "1", "Callouts": [{ "FRU Type": "Normal Hardware FRU", "Priority": "Medium Priority", "Location Code": "U78DB.ND0.WZS008L-P0-C24", "Part Number": "F210110", "CCIN": "AB42", "Serial Number": " " }] }, "SRC Details": { "Primary Attention": "system checkstop", "Signature Description": { "Chip Desc": "node 0 proc 2 (P10 2.0)", "Signature": "EQ_L3_FIR(20)[14] L3 Dir read UE", "Attn Type": "checkstop" } } }, "Extended User Header": { "Section Version": "1", "Sub-section type": "0", "Created by": "0x2000", "Reporting Machine Type": "9105-42A", "Reporting Serial Number": "783C4C1", "FW Released Ver": "", "FW SubSys Version": "None", "Common Ref Time": "00/00/0000 00:00:00", "Symptom Id Len": "36", "Symptom Id": "BD13E510_20DA0020_00020001_5074140E" }, "Failing MTMS": { "Section Version": "1", "Sub-section type": "0", "Created by": "0x2000", "Machine Type Model": "9105-42A", "Serial Number": "783C4C1" }, "User Data 0": { "Section Version": "1", "Sub-section type": "1", "Created by": "0x2000", "BMCLoad": "2.71 1.82 1.25", "BMCState": "Ready", "BMCUptime": "0y 0d 1h 27m 54s", "BootState": "SecondaryProcInit", "ChassisState": "On", "FW Version ID": "none", "HostState": "Running", "Process Name": "/usr/bin/openpower-hw-diags", "System IM": "50001000" }, "User Data 1": { "Section Version": "1", "Sub-section type": "1", "Created by": "0x2000", "PEL_SUBSYSTEM": "0x13", "SRC6": "551157792", "SRC7": "131073", "SRC8": "1349784590", "_PID": "4887" }, "User Data 2": { "Section Version": "1", "Sub-section type": "1", "Created by": "0x2000", "Data": [ { "Deconfigured": false, "EntityPath": [ 38, 1, 0, 2, 0, 5, 2, 35, 5, 83, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], "GuardType": "GARD_Unrecoverable", "Guarded": true, "LocationCode": "Ufcs-P0-C24", "Priority": "M" } ] }, "User Data 3": { "Section Version": "1", "Sub-section type": "4", "Created by": "0xE500", "Hostboot Scratch Registers": { "0x0000283c": "0xaa811504", "0x000000004602f489": "0x0000000000000000" } }, "User Data 4": { "Section Version": "1", "Sub-section type": "5", "Created by": "0xE500", "Scratch Register Error Signature": { "Chip ID": "0x004b003e", "Signature ID": "0x5993000a" } }, "User Data 5": { "Section Version": "1", "Sub-section type": "3", "Created by": "0xE500", "Callout List FFDC": [ { "Callout Type": "Hardware Callout", "Guard": true, "Priority": "medium", "Target": "physical:sys-0/node-0/proc-2/eq-5/fc-0/core-0" } ] }, "User Data 6": { "Section Version": "1", "Sub-section type": "1", "Created by": "0xE500", "Signature List": [ { "Chip Desc": "node 0 proc 0 (P10 2.0)", "Signature": "PB_EXT_FIR(0)[4] pb_x4_fir_err", "Attn Type": "checkstop" }, { "Chip Desc": "node 0 proc 1 (P10 2.0)", "Signature": "PB_EXT_FIR(0)[7] pb_x7_fir_err", "Attn Type": "checkstop" }, { "Chip Desc": "node 0 proc 2 (P10 2.0)", "Signature": "EQ_L3_FIR(20)[14] L3 Dir read UE", "Attn Type": "checkstop" }, { "Chip Desc": "node 0 proc 2 (P10 2.0)", "Signature": "EQ_L3_FIR(4)[13] L3 DIR read CE", "Attn Type": "recoverable" }, { "Chip Desc": "node 0 proc 2 (P10 2.0)", "Signature": "EQ_L3_FIR(5)[13] L3 DIR read CE", "Attn Type": "recoverable" }, { "Chip Desc": "node 0 proc 2 (P10 2.0)", "Signature": "EQ_L3_FIR(6)[13] L3 DIR read CE", "Attn Type": "recoverable" }, { "Chip Desc": "node 0 proc 2 (P10 2.0)", "Signature": "EQ_L3_FIR(11)[13] L3 DIR read CE", "Attn Type": "recoverable" }, { "Chip Desc": "node 0 proc 2 (P10 2.0)", "Signature": "EQ_L3_FIR(12)[13] L3 DIR read CE", "Attn Type": "recoverable" }, { "Chip Desc": "node 0 proc 2 (P10 2.0)", "Signature": "EQ_L3_FIR(13)[13] L3 DIR read CE", "Attn Type": "recoverable" }, { "Chip Desc": "node 0 proc 2 (P10 2.0)", "Signature": "EQ_L3_FIR(20)[13] L3 DIR read CE", "Attn Type": "recoverable" }, { "Chip Desc": "node 0 proc 2 (P10 2.0)", "Signature": "EQ_L3_FIR(22)[13] L3 DIR read CE", "Attn Type": "recoverable" }, { "Chip Desc": "node 0 proc 2 (P10 2.0)", "Signature": "EQ_L3_FIR(28)[13] L3 DIR read CE", "Attn Type": "recoverable" }, { "Chip Desc": "node 0 proc 3 (P10 2.0)", "Signature": "PB_EXT_FIR(0)[1] pb_x1_fir_err", "Attn Type": "checkstop" } ] }, "User Data 7": { "Section Version": "1", "Sub-section type": "2", "Created by": "0xE500", "Register Dump": [ "node 0 proc 0 (P10 2.0) ****", " GFIR_CS (0x570F001C) 1000 0000 0000 0000", " CFIR_N1_CS (0x03040000) 8000 0000 4000 0000", " CFIR_N1_CS_MASK (0x03040040) 2000 0000 0000 0000", " PB_EXT_FIR (0x030113AE) 0800 0000 0000 0000", " PB_EXT_FIR_MASK (0x030113B1) D400 0000 0000 0000", "node 0 ocmb 0 (Explorer 2.0) ", " CHIPLET_OCMB_FIR_MASK (0x08040002) 6627 FFE0 0000 0000", "node 0 ocmb 4 (Explorer 2.0) ", " CHIPLET_OCMB_FIR_MASK (0x08040002) 6627 FFE0 0000 0000", "node 0 ocmb 2 (Explorer 2.0) ", " CHIPLET_OCMB_FIR_MASK (0x08040002) 6627 FFE0 0000 0000", "node 0 ocmb 6 (Explorer 2.0) ", " CHIPLET_OCMB_FIR_MASK (0x08040002) 6627 FFE0 0000 0000", "node 0 proc 1 (P10 2.0) ****", " GFIR_CS (0x570F001C) 1000 0000 0000 0000", " CFIR_N1_CS (0x03040000) 8000 0000 4000 0000", " CFIR_N1_CS_MASK (0x03040040) 2000 0000 0000 0000", " PB_EXT_FIR (0x030113AE) 0100 0000 0000 0000", " PB_EXT_FIR_MASK (0x030113B1) B400 0000 0000 0000", "node 0 ocmb 24 (Explorer 2.0) **", " CHIPLET_OCMB_FIR_MASK (0x08040002) 6627 FFE0 0000 0000", "node 0 ocmb 26 (Explorer 2.0) **", " CHIPLET_OCMB_FIR_MASK (0x08040002) 6627 FFE0 0000 0000", "node 0 ocmb 28 (Explorer 2.0) **", " CHIPLET_OCMB_FIR_MASK (0x08040002) 6627 FFE0 0000 0000", "node 0 ocmb 30 (Explorer 2.0) **", " CHIPLET_OCMB_FIR_MASK (0x08040002) 6627 FFE0 0000 0000", "node 0 proc 2 (P10 2.0) ****", " GFIR_CS (0x570F001C) 0000 0000 0400 0000", " CFIR_EQ_CS (0x25040000) 8004 0000 0000 0000", " CFIR_EQ_CS_MASK (0x25040040) 2000 0000 0000 0000", " EQ_L3_FIR (0x25018600) 0007 1000 0000 0000", " EQ_L3_FIR_MASK (0x25018603) 4249 3E29 8000 0000", " EQ_L3_FIR_ACT1 (0x25018607) 3DB4 4100 0000 0000", " GFIR_RE (0x570F001B) 0000 0000 7500 0000", " CFIR_EQ_RE (0x21040001) 8007 0000 0000 0000", " EQ_L3_FIR (0x21018600) 0005 1000 0000 0000", " EQ_L3_FIR_MASK (0x21018603) 4249 3E29 8000 0000", " EQ_L3_FIR_ACT1 (0x21018607) 3DB4 4100 0000 0000", " EQ_L3_FIR (0x21014600) 0005 1008 0000 0000", " EQ_L3_FIR_MASK (0x21014603) 4249 3E29 8000 0000", " EQ_L3_FIR_ACT1 (0x21014607) 3DB4 4100 0000 0000", " EQ_L3_FIR (0x21012600) 0005 1000 0000 0000", " EQ_L3_FIR_MASK (0x21012603) 4249 3E29 8000 0000", " EQ_L3_FIR_ACT1 (0x21012607) 3DB4 4100 0000 0000", " CFIR_EQ_RE (0x22040001) 8000 8000 0000 0000", " EQ_L3_FIR (0x22011600) 0005 1000 0000 0000", " EQ_L3_FIR_MASK (0x22011603) 4249 3E29 8000 0000", " EQ_L3_FIR_ACT1 (0x22011607) 3DB4 4100 0000 0000", " CFIR_EQ_RE (0x23040001) 8006 0000 0000 0000", " CFIR_EQ_RE_MASK (0x23040041) 0198 1800 0000 0000", " EQ_L3_FIR (0x23018600) 0005 1000 0000 0000", " EQ_L3_FIR_MASK (0x23018603) 4249 3E29 8000 0000", " EQ_L3_FIR_ACT1 (0x23018607) 3DB4 4100 0000 0000", " EQ_L3_FIR (0x23014600) 0005 1008 0000 0000", " EQ_L3_FIR_MASK (0x23014603) 4249 3E29 8000 0000", " EQ_L3_FIR_ACT1 (0x23014607) 3DB4 4100 0000 0000", " CFIR_EQ_RE (0x25040001) 8005 0000 0000 0000", " EQ_L3_FIR (0x25012600) 0005 1000 0000 0000", " EQ_L3_FIR_MASK (0x25012603) 4249 3E29 8000 0000", " EQ_L3_FIR_ACT1 (0x25012607) 3DB4 4100 0000 0000", " CFIR_EQ_RE (0x27040001) 8004 0000 0000 0000", " EQ_L3_FIR (0x27018600) 0005 1000 0000 0000", " EQ_L3_FIR_MASK (0x27018603) 4249 3E29 8000 0000", " EQ_L3_FIR_ACT1 (0x27018607) 3DB4 4100 0000 0000", "node 0 ocmb 32 (Explorer 2.0) **", " CHIPLET_OCMB_FIR_MASK (0x08040002) 6627 FFE0 0000 0000", "node 0 ocmb 34 (Explorer 2.0) **", " CHIPLET_OCMB_FIR_MASK (0x08040002) 6627 FFE0 0000 0000", "node 0 ocmb 36 (Explorer 2.0) **", " CHIPLET_OCMB_FIR_MASK (0x08040002) 6627 FFE0 0000 0000", "node 0 ocmb 38 (Explorer 2.0) **", " CHIPLET_OCMB_FIR_MASK (0x08040002) 6627 FFE0 0000 0000", "node 0 proc 3 (P10 2.0) ****", " GFIR_CS (0x570F001C) 1000 0000 0000 0000", " CFIR_N1_CS (0x03040000) 8000 0000 4000 0000", " CFIR_N1_CS_MASK (0x03040040) 2000 0000 0000 0000", " PB_EXT_FIR (0x030113AE) 4000 0000 0000 0000", " PB_EXT_FIR_MASK (0x030113B1) B400 0000 0000 0000", "node 0 ocmb 56 (Explorer 2.0) **", " CHIPLET_OCMB_FIR_MASK (0x08040002) 6627 FFE0 0000 0000", "node 0 ocmb 58 (Explorer 2.0) **", " CHIPLET_OCMB_FIR_MASK (0x08040002) 6627 FFE0 0000 0000", "node 0 ocmb 60 (Explorer 2.0) **", " CHIPLET_OCMB_FIR_MASK (0x08040002) 6627 FFE0 0000 0000", "node 0 ocmb 62 (Explorer 2.0) **", " CHIPLET_OCMB_FIR_MASK (0x08040002) 6627 FFE0 0000 0000" ] } }

dcrowell77 commented 1 year ago

I assume you are booting into OPAL? Do you have all power management disabled? What is the last thing you see in the boot console?

This piece of FFDC points to the IPL being through Hostboot and into the payload (PHYP or OPAL). "Hostboot Scratch Registers": { "0x0000283c": "0xaa811504", ==> istep 21.4 "0x000000004602f489": "0x0000000000000000" ==> Says "hostboot" while we're running, set to zero when we jump to payload }

You should probably engage the Opal team (https://github.com/open-power/skiboot) to see if they are aware of any issues.

mikey commented 1 year ago

@lili-lilili Do you have the OPAL boot log for the small core case so we can take a look?

lili-lilili commented 1 year ago

@lili-lilili Do you have the OPAL boot log for the small core case so we can take a look?

For OPAL log: https://github.com/open-power/skiboot/issues/275

dcrowell77 commented 1 year ago

This is being discussed in the skiboot repo so closing this issue