Closed ghost closed 5 years ago
It only fails on Boston, not Witherspoon? Does it fail on multiple machines? This smells like a real memory error. I know that the two systems do have slightly different memory configurations, but can't remember off the top of my head what the diffs are. I haven't heard any reports from DVT of issues so it might be related to a specific dimm config too. Can you post the fru inventory off this box? Does it boot fine if you leave this dimm garded?
Dan notifications@github.com writes:
It only fails on Boston, not Witherspoon? Does it fail on multiple machines? This smells like a real memory error. I know that the two systems do have slightly different memory configurations, but can't remember off the top of my head what the diffs are. I haven't heard any reports from DVT of issues so it might be related to a specific dimm config too. Can you post the fru inventory off this box? Does it boot fine if you leave this dimm garded?
It progressively guards 3 DIMMS (all of which worked fine this morning, and had maybe 2 months of uptime).
root@bstn004p1:~# ipmitool fru print
FRU Device Description : Builtin FRU Device (ID 0)
Chassis Type : Unknown
Chassis Part Number : 9006-22P
Chassis Serial : C829UAF32B00510
Board Mfg Date : Sun Dec 31 19:00:00 1995
Board Mfg : IBM
Board Product : SYSTEM PLANAR
Board Serial : 0M174S029003
Board Part Number : P9DSU
Product Manufacturer : IBM
Product Name : P9DSU-IBM-2
Product Part Number : SSP-6029U-TR4T-IB001
Product Version : NONE
Product Serial : S283248X7505191
Product Asset Tag : NONE
FRU Device Description : CPU 1 (ID 1) Board Mfg Date : Sun Dec 31 19:00:00 1995
Board Mfg : IBM Board Product : PROCESSOR MODULE
Board Serial : YA1934293886
Board Part Number : 02CY086
Board Extra : EC:22
FRU Device Description : CPU 2 (ID 2)
Board Mfg Date : Sun Dec 31 19:00:00 1995
Board Mfg : IBM
Board Product : PROCESSOR MODULE
Board Serial : YA1934305407
Board Part Number : 02CY086
Board Extra : EC:22
FRU Device Description : Backplane (ID 3)
Chassis Type : Unknown
Chassis Part Number : 9006-22P
Chassis Serial : C829UAF32B00510
Board Mfg Date : Sun Dec 31 19:00:00 1995
Board Mfg : IBM
Board Product : SYSTEM PLANAR
Board Serial : 0M174S029003
Board Part Number : P9DSU
FRU Device Description : P1-DIMMA1 (ID 12)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 16GiB 64-bit ECC RDIMM
Product Part Number : M393A2K40BB2-CTD
Product Version : 00
Product Serial : 3569b648
FRU Device Description : P1-DIMMA2 (ID 13)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 16GiB 64-bit ECC RDIMM
Product Part Number : M393A2K40BB2-CTD
Product Version : 00
Product Serial : 3569c3f4
FRU Device Description : P1-DIMMB1 (ID 14)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 16GiB 64-bit ECC RDIMM
Product Part Number : M393A2K40BB2-CTD
Product Version : 00
Product Serial : 3569b77a
FRU Device Description : P1-DIMMB2 (ID 15)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 16GiB 64-bit ECC RDIMM
Product Part Number : M393A2K40BB2-CTD
Product Version : 00
Product Serial : 3569c35b
FRU Device Description : P1-DIMMC1 (ID 16)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 16GiB 64-bit ECC RDIMM
Product Part Number : M393A2K40BB2-CTD
Product Version : 00
Product Serial : 3569bf4b
FRU Device Description : P1-DIMMC2 (ID 17)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 16GiB 64-bit ECC RDIMM
Product Part Number : M393A2K40BB2-CTD
Product Version : 00
Product Serial : 357e1308
FRU Device Description : P1-DIMMD1 (ID 18)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 16GiB 64-bit ECC RDIMM
Product Part Number : M393A2K40BB2-CTD
Product Version : 00
Product Serial : 3569bb5c
FRU Device Description : P1-DIMMD2 (ID 19)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 16GiB 64-bit ECC RDIMM
Product Part Number : M393A2K40BB2-CTD
Product Version : 00
Product Serial : 3569c3a7
FRU Device Description : P2-DIMMA1 (ID 20)
Unknown FRU header version 0x00
FRU Device Description : P2-DIMMA2 (ID 21)
Unknown FRU header version 0x00
FRU Device Description : P2-DIMMB1 (ID 22)
Unknown FRU header version 0x00
FRU Device Description : P2-DIMMB2 (ID 23)
Unknown FRU header version 0x00
FRU Device Description : P2-DIMMC1 (ID 24)
Unknown FRU header version 0x00
FRU Device Description : P2-DIMMC2 (ID 25)
Unknown FRU header version 0x00
FRU Device Description : P2-DIMMD1 (ID 26)
Unknown FRU header version 0x00
FRU Device Description : P2-DIMMD2 (ID 27)
Unknown FRU header version 0x00
FRU Device Description : System Firmware (ID 47)
Product Name : OpenPOWER Firmware
Product Version : open-power-p9dsu-v2.1-198-g2184c14
Product Extra : buildroot-2018.08.2-8-gd5fc953
Product Extra : skiboot-v6.1-188-g606a5a3d44e3
Product Extra : hostboot-40a34c9-p36fbe89
Product Extra : occ-d7adf6c
Product Extra : linux-4.19.1-openpower1-p147caa7
Product Extra : petitboot-1.9.1
Product Extra : machine-xml-32ce616
Product Extra : hostbo
FRU Device Description : PSU 1 (ID 60)
Product Manufacturer : SUPERMICRO
Product Name : PWS-1K62A-1R
Product Part Number : PWS-1K62A-1R
Product Version : 1.1
Product Serial : P1K6BCG31LB0726
FRU Device Description : PSU 2 (ID 61)
Product Manufacturer : SUPERMICRO
Product Name : PWS-1K62A-1R
Product Part Number : PWS-1K62A-1R
Product Version : 1.1
Product Serial : P1K6BCG31LB0725
I went away and came back and for the second time I booted with this:
root@bstn004p1:~# opal-gard list
ID | Error | Type | Path
---------------------------------------------------------
00000001 | 90000005 | Fatal | /Sys0/Node0/DIMM1
00000002 | 9000000a | Fatal | /Sys0/Node0/DIMM7
00000003 | 9000000f | Fatal | /Sys0/Node0/DIMM3
=========================================================
and I eventually booted, but I have also seen it not boot in that configuration.
-- Stewart Smith OPAL Architect, IBM.
Dan notifications@github.com writes:
It only fails on Boston, not Witherspoon? Does it fail on multiple machines? This smells like a real memory error. I know that the two systems do have slightly different memory configurations, but can't remember off the top of my head what the diffs are. I haven't heard any reports from DVT of issues so it might be related to a specific dimm config too. Can you post the fru inventory off this box? Does it boot fine if you leave this dimm garded?
To check if it failed on Witherspoon I'd need a witherspoon to test it - we only have one.
-- Stewart Smith OPAL Architect, IBM.
root@bstn004p1:~# opal-gard list
ID | Error | Type | Path
---------------------------------------------------------
00000001 | 90000014 | Fatal | /Sys0/Node0/DIMM1
00000002 | 90000019 | Fatal | /Sys0/Node0/DIMM7
00000003 | 9000001e | Predictive | /Sys0/Node0/DIMM5
00000004 | 90000025 | Predictive | /Sys0/Node0/DIMM3
=========================================================
is the next thing that worked
This morning, all of my memory worked fine.. At least I have 64GB of RAM now?
Also, I get:
[ 16.165513] Memory failure: 0x0: reserved kernel page still referenced by 1 users
[ 16.166347] Memory failure: 0x0: recovery action for reserved kernel page: Failed
[ 17.268941] Memory failure: 0x0: already hardware poisoned
[ 17.269908] Memory failure: 0x1: reserved kernel page still referenced by 1 users
[ 17.270776] Memory failure: 0x1: recovery action for reserved kernel page: Failed
once booted to an OS. so it looks like something is pretty dire.
Here's the opal-prd logs of the successful boot: opal-prd.txt
Nov 19 23:07:14 bstn004p1 opal-prd[2899]: MEM: Failed to offline memory! page addr: 0000000000000400 type: 1: Device or resource busy
Nov 19 23:07:24 bstn004p1 opal-prd[2899]: MEM: Memory error: range 0000000000000400-00000007fffff5c0, type: uncorrectable
Nov 19 23:07:24 bstn004p1 opal-prd[2899]: MEM: Failed to offline memory! page addr: 0000000000010400 type: 1: Device or resource busy
-- Reboot --
Nov 20 00:44:13 bstn004p1 opal-prd[2894]: MEM: Memory error: range 0000000000000800-0000000fffffeb40, type: uncorrectable
Nov 20 00:44:13 bstn004p1 opal-prd[2894]: MEM: Failed to offline memory! page addr: 0000000000000800 type: 1: Device or resource busy
Nov 20 00:44:14 bstn004p1 opal-prd[2894]: MEM: Memory error: range 0000000000000800-0000000fffffeb40, type: uncorrectable
Nov 20 00:44:14 bstn004p1 opal-prd[2894]: MEM: Failed to offline memory! page addr: 0000000000010800 type: 1: Device or resource busy
this is the error parts of the log... this looks pretty strange/wrong.
This morning, all of my memory worked fine.. At least I have 64GB of RAM now?
So now it boots with no failures?
Do you have a last known good level?
It booted in that config, but I wouldn't dare try rebooting as I wouldn't expect it to boot twice in a row. There were errors and guarded out DIMMs though.
https://github.com/open-power/sbe/issues/13 breaks compatibility in easily going back. Luckily someone fixed the cronus setup overnight, so hopefully I can go back and try something.
If I go back to hostboot and hcode (there's a codependency) prior to op-build 6acfb3e3e7ebe06708cb03d83d06d46ed1ab2cfb - that is, hw102318a.930 hcode and hostboot fecb93f473161bae5bded405aaca525c78f80a22 then it works and I get all my memory.
Attempting to bisect it down.
git bisect start
# bad: [9d418f5eefe35bd533928cff03822943dcb7852e] Add missing mutex in LPC error path
git bisect bad 9d418f5eefe35bd533928cff03822943dcb7852e
# good: [876b79aacd9b14f4c3561e954daa0285747c9662] Fix for SBE_P9_XIP_CUSTOMIZE_UNSUCCESSFUL during ipl with one EX
git bisect good 876b79aacd9b14f4c3561e954daa0285747c9662
# good: [fecb93f473161bae5bded405aaca525c78f80a22] Fix Centaur workaround in p9c_mss_row_repair
git bisect good fecb93f473161bae5bded405aaca525c78f80a22
# good: [ad52fe4087a24997857752c2526807021c14ef5f] PM: Fixed handling of CME LFIR mask during PM complex reset.
git bisect good ad52fe4087a24997857752c2526807021c14ef5f
# good: [0e15017d11ea9ff2dd705119feeb9ed73ed405dc] Add exp_i2c_scom driver that will be consumed by HB/SBE platforms
git bisect good 0e15017d11ea9ff2dd705119feeb9ed73ed405dc
# bad: [8351efdb3b65ed4fc5472e78efd5db315663e42f] Inband MMIO access to OCMB (skeleton)
git bisect bad 8351efdb3b65ed4fc5472e78efd5db315663e42f
# bad: [e68587e470a3fe50465de722a9d74db1937f5ab3] Support flag parameter for addBusCallout
git bisect bad e68587e470a3fe50465de722a9d74db1937f5ab3
and looking there, I'm now trying just to revert this commit:
commit 40a34c94a981ebfe9e1ff95263663cda0cbaaa42
Author: Stephen Glancy <sglancy@us.ibm.com>
Date: Mon Oct 22 21:33:27 2018 -0500
Fixes LRDIMM eff_config bugs
Change-Id: I74dd2332bda79ab9578d450ba74322fd953b1f46
Reviewed-on: http://rchgit01.rchland.ibm.com/gerrit1/67863
Tested-by: Jenkins Server <pfd-jenkins+hostboot@us.ibm.com>
Reviewed-by: Louis Stermole <stermole@us.ibm.com>
Reviewed-by: STEPHEN GLANCY <sglancy@us.ibm.com>
Dev-Ready: STEPHEN GLANCY <sglancy@us.ibm.com>
Tested-by: HWSV CI <hwsv-ci+hostboot@us.ibm.com>
Reviewed-by: ANDRE A. MARIN <aamarin@us.ibm.com>
Tested-by: Hostboot CI <hostboot-ci+hostboot@us.ibm.com>
Reviewed-by: Jennifer A. Stofer <stofer@us.ibm.com>
Reviewed-on: http://rchgit01.rchland.ibm.com/gerrit1/68244
Tested-by: Jenkins OP Build CI <op-jenkins+hostboot@us.ibm.com>
Tested-by: FSP CI Jenkins <fsp-CI-jenkins+hostboot@us.ibm.com>
Reviewed-by: Daniel M. Crowell <dcrowell@us.ibm.com>
and trying hostboot master with this one commit reverted: things work fine!
The FRUs for memory on a Boston where it does not occur:
FRU Device Description : P1-DIMMA1 (ID 12)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 32GiB 64-bit ECC RDIMM
Product Part Number : M393A4K40BB2-CTD
Product Version : 00
Product Serial : 34ee7214
FRU Device Description : P1-DIMMA2 (ID 13)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 32GiB 64-bit ECC RDIMM
Product Part Number : M393A4K40BB2-CTD
Product Version : 00
Product Serial : 34ee7216
FRU Device Description : P1-DIMMB1 (ID 14)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 32GiB 64-bit ECC RDIMM
Product Part Number : M393A4K40BB2-CTD
Product Version : 00
Product Serial : 34ee72ff
FRU Device Description : P1-DIMMB2 (ID 15)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 32GiB 64-bit ECC RDIMM
Product Part Number : M393A4K40BB2-CTD
Product Version : 00
Product Serial : 34ee6ec7
FRU Device Description : P1-DIMMC1 (ID 16)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 32GiB 64-bit ECC RDIMM
Product Part Number : M393A4K40BB2-CTD
Product Version : 00
Product Serial : 34ee7300
FRU Device Description : P1-DIMMC2 (ID 17)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 32GiB 64-bit ECC RDIMM
Product Part Number : M393A4K40BB2-CTD
Product Version : 00
Product Serial : 34ee94ed
FRU Device Description : P1-DIMMD1 (ID 18)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 32GiB 64-bit ECC RDIMM
Product Part Number : M393A4K40BB2-CTD
Product Version : 00
Product Serial : 34ee7304
FRU Device Description : P1-DIMMD2 (ID 19)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 32GiB 64-bit ECC RDIMM
Product Part Number : M393A4K40BB2-CTD
Product Version : 00
Product Serial : 34ee7215
FRU Device Description : P2-DIMMA1 (ID 20)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 32GiB 64-bit ECC RDIMM
Product Part Number : M393A4K40BB2-CTD
Product Version : 00
Product Serial : 34ee8e98
FRU Device Description : P2-DIMMA2 (ID 21)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 32GiB 64-bit ECC RDIMM
Product Part Number : M393A4K40BB2-CTD
Product Version : 00
Product Serial : 34ee6a93
FRU Device Description : P2-DIMMB1 (ID 22)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 32GiB 64-bit ECC RDIMM
Product Part Number : M393A4K40BB2-CTD
Product Version : 00
Product Serial : 34ee7301
FRU Device Description : P2-DIMMB2 (ID 23)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 32GiB 64-bit ECC RDIMM
Product Part Number : M393A4K40BB2-CTD
Product Version : 00
Product Serial : 34ee7302
FRU Device Description : P2-DIMMC1 (ID 24)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 32GiB 64-bit ECC RDIMM
Product Part Number : M393A4K40BB2-CTD
Product Version : 00
Product Serial : 34ee8ea0
FRU Device Description : P2-DIMMC2 (ID 25)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 32GiB 64-bit ECC RDIMM
Product Part Number : M393A4K40BB2-CTD
Product Version : 00
Product Serial : 34ee9128
FRU Device Description : P2-DIMMD1 (ID 26)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 32GiB 64-bit ECC RDIMM
Product Part Number : M393A4K40BB2-CTD
Product Version : 00
Product Serial : 34ee8e30
FRU Device Description : P2-DIMMD2 (ID 27)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 32GiB 64-bit ECC RDIMM
Product Part Number : M393A4K40BB2-CTD
Product Version : 00
Product Serial : 34ee7211
and another one where it does:
FRU Device Description : P1-DIMMA1 (ID 12)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 16GiB 64-bit ECC RDIMM
Product Part Number : M393A2K40BB2-CTD
Product Version : 00
Product Serial : 3446d454
FRU Device Description : P1-DIMMA2 (ID 13)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 16GiB 64-bit ECC RDIMM
Product Part Number : M393A2K40BB2-CTD
Product Version : 00
Product Serial : 3447045f
FRU Device Description : P1-DIMMB1 (ID 14)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 16GiB 64-bit ECC RDIMM
Product Part Number : M393A2K40BB2-CTD
Product Version : 00
Product Serial : 34470ada
FRU Device Description : P1-DIMMB2 (ID 15)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 16GiB 64-bit ECC RDIMM
Product Part Number : M393A2K40BB2-CTD
Product Version : 00
Product Serial : 3446e9f6
FRU Device Description : P1-DIMMC1 (ID 16)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 16GiB 64-bit ECC RDIMM
Product Part Number : M393A2K40BB2-CTD
Product Version : 00
Product Serial : 3447045a
FRU Device Description : P1-DIMMC2 (ID 17)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 16GiB 64-bit ECC RDIMM
Product Part Number : M393A2K40BB2-CTD
Product Version : 00
Product Serial : 34470210
FRU Device Description : P1-DIMMD1 (ID 18)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 16GiB 64-bit ECC RDIMM
Product Part Number : M393A2K40BB2-CTD
Product Version : 00
Product Serial : 344709a4
FRU Device Description : P1-DIMMD2 (ID 19)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 16GiB 64-bit ECC RDIMM
Product Part Number : M393A2K40BB2-CTD
Product Version : 00
Product Serial : 34470245
FRU Device Description : P2-DIMMA1 (ID 20)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 16GiB 64-bit ECC RDIMM
Product Part Number : M393A2K40BB2-CTD
Product Version : 00
Product Serial : 34470212
FRU Device Description : P2-DIMMA2 (ID 21)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 16GiB 64-bit ECC RDIMM
Product Part Number : M393A2K40BB2-CTD
Product Version : 00
Product Serial : 34470a41
FRU Device Description : P2-DIMMB1 (ID 22)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 16GiB 64-bit ECC RDIMM
Product Part Number : M393A2K40BB2-CTD
Product Version : 00
Product Serial : 34470244
FRU Device Description : P2-DIMMB2 (ID 23)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 16GiB 64-bit ECC RDIMM
Product Part Number : M393A2K40BB2-CTD
Product Version : 00
Product Serial : 344704d7
FRU Device Description : P2-DIMMC1 (ID 24)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 16GiB 64-bit ECC RDIMM
Product Part Number : M393A2K40BB2-CTD
Product Version : 00
Product Serial : 3447059f
FRU Device Description : P2-DIMMC2 (ID 25)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 16GiB 64-bit ECC RDIMM
Product Part Number : M393A2K40BB2-CTD
Product Version : 00
Product Serial : 34470533
FRU Device Description : P2-DIMMD1 (ID 26)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 16GiB 64-bit ECC RDIMM
Product Part Number : M393A2K40BB2-CTD
Product Version : 00
Product Serial : 3447011e
FRU Device Description : P2-DIMMD2 (ID 27)
Product Manufacturer : Samsung Electronics
Product Name : DDR4-2666 16GiB 64-bit ECC RDIMM
Product Part Number : M393A2K40BB2-CTD
Product Version : 00
Product Serial : 3446fe73
The commonality between systems where we fail seems to be the 16GB DIMMs
We're currently working to recreate this issue on one of our lab systems
So, Hostboot master appears broken on Boston. Won't get past istep15.1. BMC2.02 and op-build 2251a24d7c555878e0fc9fe04bed856375c98e89 - i.e. hostboot
hostboot-40a34c9-p36fbe89/hbicore.bin
fails to IPL on Boston.... and I don't seem to be able to get back to any recent op-build that works 2184c1452e4134a8ea68e26312cccf205a600c28 also fails