open-power / hostboot

System initialization firmware for Power systems
Apache License 2.0
75 stars 97 forks source link

Fails to boot on p9dsu with 16GB DIMMs #150

Closed ghost closed 5 years ago

ghost commented 5 years ago

So, Hostboot master appears broken on Boston. Won't get past istep15.1. BMC2.02 and op-build 2251a24d7c555878e0fc9fe04bed856375c98e89 - i.e. hostboot hostboot-40a34c9-p36fbe89/hbicore.bin fails to IPL on Boston.

10.12460|ERRL|Dumping errors reported prior to registration   
 10.12793|================================================     
 10.12794|Error reported by prdf (0xE500) PLID 0x90000005      
 10.12794|  PRD Signature            : 0x240004 0x18A0000E     
 10.15353|  Signature Description    : pu.mca:k0:n0:s0:p00:c4 (MCAECCFIR[14]) Mainline read UE                                 
 10.17354|  UserData1   : 0x0024000400000103                                                                                   
 10.17354|  UserData2   : 0x18a0000e88047008                   
 10.17354|------------------------------------------------     
 10.18357|  Callout type             : Hardware Callout        
 10.20356|  CPU id                   : 2                       
 10.22360|  Target                   : Physical:/Sys0/Node0/DIMM5                                                              
 10.22361|  Deconfig State           : NO_DECONFIG             
 10.22361|  GARD Error Type          : GARD_Fatal                                                                              
 10.22362|  Priority                 : SRCI_PRIORITY_MED       
 10.22362|------------------------------------------------     
 10.22362|                                                     
 10.22363|------------------------------------------------                                                                     
 10.22363|  System checkstop occurred during IPL on previous boot                                                              
 10.22364|------------------------------------------------                                                                     
 10.22364|                                                                                                                     
 10.22364|------------------------------------------------     
 10.22365|  Hostboot Build ID: hostboot-40a34c9-p36fbe89/hbicore.bin                                                           
 10.22365|================================================

... and I don't seem to be able to get back to any recent op-build that works 2184c1452e4134a8ea68e26312cccf205a600c28 also fails

dcrowell77 commented 5 years ago

It only fails on Boston, not Witherspoon? Does it fail on multiple machines? This smells like a real memory error. I know that the two systems do have slightly different memory configurations, but can't remember off the top of my head what the diffs are. I haven't heard any reports from DVT of issues so it might be related to a specific dimm config too. Can you post the fru inventory off this box? Does it boot fine if you leave this dimm garded?

ghost commented 5 years ago

Dan notifications@github.com writes:

It only fails on Boston, not Witherspoon? Does it fail on multiple machines? This smells like a real memory error. I know that the two systems do have slightly different memory configurations, but can't remember off the top of my head what the diffs are. I haven't heard any reports from DVT of issues so it might be related to a specific dimm config too. Can you post the fru inventory off this box? Does it boot fine if you leave this dimm garded?

It progressively guards 3 DIMMS (all of which worked fine this morning, and had maybe 2 months of uptime).

root@bstn004p1:~# ipmitool fru print
FRU Device Description : Builtin FRU Device (ID 0)
 Chassis Type          : Unknown
 Chassis Part Number   : 9006-22P
 Chassis Serial        : C829UAF32B00510
 Board Mfg Date        : Sun Dec 31 19:00:00 1995               
 Board Mfg             : IBM
 Board Product         : SYSTEM PLANAR                          
 Board Serial          : 0M174S029003
 Board Part Number     : P9DSU
 Product Manufacturer  : IBM
 Product Name          : P9DSU-IBM-2
 Product Part Number   : SSP-6029U-TR4T-IB001
 Product Version       : NONE
 Product Serial        : S283248X7505191
 Product Asset Tag     : NONE

FRU Device Description : CPU 1 (ID 1)                                                                                             Board Mfg Date        : Sun Dec 31 19:00:00 1995               
 Board Mfg             : IBM                                                                                                      Board Product         : PROCESSOR MODULE                       
 Board Serial          : YA1934293886
 Board Part Number     : 02CY086
 Board Extra           : EC:22

FRU Device Description : CPU 2 (ID 2)
 Board Mfg Date        : Sun Dec 31 19:00:00 1995
 Board Mfg             : IBM
 Board Product         : PROCESSOR MODULE
 Board Serial          : YA1934305407
 Board Part Number     : 02CY086
 Board Extra           : EC:22

FRU Device Description : Backplane (ID 3)
 Chassis Type          : Unknown
 Chassis Part Number   : 9006-22P
 Chassis Serial        : C829UAF32B00510
 Board Mfg Date        : Sun Dec 31 19:00:00 1995
 Board Mfg             : IBM
 Board Product         : SYSTEM PLANAR
 Board Serial          : 0M174S029003
 Board Part Number     : P9DSU

FRU Device Description : P1-DIMMA1 (ID 12)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 16GiB 64-bit ECC RDIMM
 Product Part Number   : M393A2K40BB2-CTD
 Product Version       : 00
 Product Serial        : 3569b648

FRU Device Description : P1-DIMMA2 (ID 13)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 16GiB 64-bit ECC RDIMM
 Product Part Number   : M393A2K40BB2-CTD
 Product Version       : 00
 Product Serial        : 3569c3f4

FRU Device Description : P1-DIMMB1 (ID 14)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 16GiB 64-bit ECC RDIMM
 Product Part Number   : M393A2K40BB2-CTD
 Product Version       : 00
 Product Serial        : 3569b77a

FRU Device Description : P1-DIMMB2 (ID 15)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 16GiB 64-bit ECC RDIMM
 Product Part Number   : M393A2K40BB2-CTD
 Product Version       : 00
 Product Serial        : 3569c35b

FRU Device Description : P1-DIMMC1 (ID 16)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 16GiB 64-bit ECC RDIMM
 Product Part Number   : M393A2K40BB2-CTD
 Product Version       : 00
 Product Serial        : 3569bf4b

FRU Device Description : P1-DIMMC2 (ID 17)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 16GiB 64-bit ECC RDIMM
 Product Part Number   : M393A2K40BB2-CTD
 Product Version       : 00
 Product Serial        : 357e1308

FRU Device Description : P1-DIMMD1 (ID 18)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 16GiB 64-bit ECC RDIMM
 Product Part Number   : M393A2K40BB2-CTD
 Product Version       : 00
 Product Serial        : 3569bb5c

FRU Device Description : P1-DIMMD2 (ID 19)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 16GiB 64-bit ECC RDIMM
 Product Part Number   : M393A2K40BB2-CTD
 Product Version       : 00
 Product Serial        : 3569c3a7

FRU Device Description : P2-DIMMA1 (ID 20)
 Unknown FRU header version 0x00

FRU Device Description : P2-DIMMA2 (ID 21)
 Unknown FRU header version 0x00

FRU Device Description : P2-DIMMB1 (ID 22)
 Unknown FRU header version 0x00

FRU Device Description : P2-DIMMB2 (ID 23)
 Unknown FRU header version 0x00

FRU Device Description : P2-DIMMC1 (ID 24)
 Unknown FRU header version 0x00

FRU Device Description : P2-DIMMC2 (ID 25)
 Unknown FRU header version 0x00

FRU Device Description : P2-DIMMD1 (ID 26)
 Unknown FRU header version 0x00

FRU Device Description : P2-DIMMD2 (ID 27)
 Unknown FRU header version 0x00

FRU Device Description : System Firmware (ID 47)
 Product Name          : OpenPOWER Firmware
 Product Version       : open-power-p9dsu-v2.1-198-g2184c14
 Product Extra         :        buildroot-2018.08.2-8-gd5fc953
 Product Extra         :        skiboot-v6.1-188-g606a5a3d44e3
 Product Extra         :        hostboot-40a34c9-p36fbe89
 Product Extra         :        occ-d7adf6c
 Product Extra         :        linux-4.19.1-openpower1-p147caa7
 Product Extra         :        petitboot-1.9.1
 Product Extra         :        machine-xml-32ce616
 Product Extra         :        hostbo

FRU Device Description : PSU 1 (ID 60)
 Product Manufacturer  : SUPERMICRO
 Product Name          : PWS-1K62A-1R
 Product Part Number   : PWS-1K62A-1R
 Product Version       : 1.1
 Product Serial        : P1K6BCG31LB0726

FRU Device Description : PSU 2 (ID 61)
 Product Manufacturer  : SUPERMICRO
 Product Name          : PWS-1K62A-1R
 Product Part Number   : PWS-1K62A-1R
 Product Version       : 1.1
 Product Serial        : P1K6BCG31LB0725

I went away and came back and for the second time I booted with this:

root@bstn004p1:~# opal-gard list
 ID       | Error    | Type       | Path
---------------------------------------------------------
 00000001 | 90000005 | Fatal      | /Sys0/Node0/DIMM1
 00000002 | 9000000a | Fatal      | /Sys0/Node0/DIMM7
 00000003 | 9000000f | Fatal      | /Sys0/Node0/DIMM3
=========================================================

and I eventually booted, but I have also seen it not boot in that configuration.

-- Stewart Smith OPAL Architect, IBM.

ghost commented 5 years ago

Dan notifications@github.com writes:

It only fails on Boston, not Witherspoon? Does it fail on multiple machines? This smells like a real memory error. I know that the two systems do have slightly different memory configurations, but can't remember off the top of my head what the diffs are. I haven't heard any reports from DVT of issues so it might be related to a specific dimm config too. Can you post the fru inventory off this box? Does it boot fine if you leave this dimm garded?

To check if it failed on Witherspoon I'd need a witherspoon to test it - we only have one.

-- Stewart Smith OPAL Architect, IBM.

ghost commented 5 years ago
root@bstn004p1:~# opal-gard list
 ID       | Error    | Type       | Path
---------------------------------------------------------
 00000001 | 90000014 | Fatal      | /Sys0/Node0/DIMM1
 00000002 | 90000019 | Fatal      | /Sys0/Node0/DIMM7
 00000003 | 9000001e | Predictive | /Sys0/Node0/DIMM5
 00000004 | 90000025 | Predictive | /Sys0/Node0/DIMM3
=========================================================

is the next thing that worked

This morning, all of my memory worked fine.. At least I have 64GB of RAM now?

Also, I get:

[   16.165513] Memory failure: 0x0: reserved kernel page still referenced by 1 users
[   16.166347] Memory failure: 0x0: recovery action for reserved kernel page: Failed
[   17.268941] Memory failure: 0x0: already hardware poisoned
[   17.269908] Memory failure: 0x1: reserved kernel page still referenced by 1 users
[   17.270776] Memory failure: 0x1: recovery action for reserved kernel page: Failed

once booted to an OS. so it looks like something is pretty dire.

ghost commented 5 years ago

Here's the opal-prd logs of the successful boot: opal-prd.txt

ghost commented 5 years ago
Nov 19 23:07:14 bstn004p1 opal-prd[2899]: MEM: Failed to offline memory! page addr: 0000000000000400 type: 1: Device or resource busy
Nov 19 23:07:24 bstn004p1 opal-prd[2899]: MEM: Memory error: range 0000000000000400-00000007fffff5c0, type: uncorrectable
Nov 19 23:07:24 bstn004p1 opal-prd[2899]: MEM: Failed to offline memory! page addr: 0000000000010400 type: 1: Device or resource busy
-- Reboot --
Nov 20 00:44:13 bstn004p1 opal-prd[2894]: MEM: Memory error: range 0000000000000800-0000000fffffeb40, type: uncorrectable
Nov 20 00:44:13 bstn004p1 opal-prd[2894]: MEM: Failed to offline memory! page addr: 0000000000000800 type: 1: Device or resource busy
Nov 20 00:44:14 bstn004p1 opal-prd[2894]: MEM: Memory error: range 0000000000000800-0000000fffffeb40, type: uncorrectable
Nov 20 00:44:14 bstn004p1 opal-prd[2894]: MEM: Failed to offline memory! page addr: 0000000000010800 type: 1: Device or resource busy

this is the error parts of the log... this looks pretty strange/wrong.

dcrowell77 commented 5 years ago

This morning, all of my memory worked fine.. At least I have 64GB of RAM now?

So now it boots with no failures?

dcrowell77 commented 5 years ago

Do you have a last known good level?

ghost commented 5 years ago

It booted in that config, but I wouldn't dare try rebooting as I wouldn't expect it to boot twice in a row. There were errors and guarded out DIMMs though.

https://github.com/open-power/sbe/issues/13 breaks compatibility in easily going back. Luckily someone fixed the cronus setup overnight, so hopefully I can go back and try something.

ghost commented 5 years ago

If I go back to hostboot and hcode (there's a codependency) prior to op-build 6acfb3e3e7ebe06708cb03d83d06d46ed1ab2cfb - that is, hw102318a.930 hcode and hostboot fecb93f473161bae5bded405aaca525c78f80a22 then it works and I get all my memory.

Attempting to bisect it down.

ghost commented 5 years ago
git bisect start
# bad: [9d418f5eefe35bd533928cff03822943dcb7852e] Add missing mutex in LPC error path
git bisect bad 9d418f5eefe35bd533928cff03822943dcb7852e
# good: [876b79aacd9b14f4c3561e954daa0285747c9662] Fix for SBE_P9_XIP_CUSTOMIZE_UNSUCCESSFUL during ipl with one EX
git bisect good 876b79aacd9b14f4c3561e954daa0285747c9662
# good: [fecb93f473161bae5bded405aaca525c78f80a22] Fix Centaur workaround in p9c_mss_row_repair
git bisect good fecb93f473161bae5bded405aaca525c78f80a22
# good: [ad52fe4087a24997857752c2526807021c14ef5f] PM: Fixed handling of CME LFIR mask during PM complex reset.
git bisect good ad52fe4087a24997857752c2526807021c14ef5f
# good: [0e15017d11ea9ff2dd705119feeb9ed73ed405dc] Add exp_i2c_scom driver that will be consumed by HB/SBE platforms
git bisect good 0e15017d11ea9ff2dd705119feeb9ed73ed405dc
# bad: [8351efdb3b65ed4fc5472e78efd5db315663e42f] Inband MMIO access to OCMB (skeleton)
git bisect bad 8351efdb3b65ed4fc5472e78efd5db315663e42f
# bad: [e68587e470a3fe50465de722a9d74db1937f5ab3] Support flag parameter for addBusCallout
git bisect bad e68587e470a3fe50465de722a9d74db1937f5ab3

and looking there, I'm now trying just to revert this commit:

commit 40a34c94a981ebfe9e1ff95263663cda0cbaaa42
Author: Stephen Glancy <sglancy@us.ibm.com>
Date:   Mon Oct 22 21:33:27 2018 -0500

    Fixes LRDIMM eff_config bugs

    Change-Id: I74dd2332bda79ab9578d450ba74322fd953b1f46
    Reviewed-on: http://rchgit01.rchland.ibm.com/gerrit1/67863
    Tested-by: Jenkins Server <pfd-jenkins+hostboot@us.ibm.com>
    Reviewed-by: Louis Stermole <stermole@us.ibm.com>
    Reviewed-by: STEPHEN GLANCY <sglancy@us.ibm.com>
    Dev-Ready: STEPHEN GLANCY <sglancy@us.ibm.com>
    Tested-by: HWSV CI <hwsv-ci+hostboot@us.ibm.com>
    Reviewed-by: ANDRE A. MARIN <aamarin@us.ibm.com>
    Tested-by: Hostboot CI <hostboot-ci+hostboot@us.ibm.com>
    Reviewed-by: Jennifer A. Stofer <stofer@us.ibm.com>
    Reviewed-on: http://rchgit01.rchland.ibm.com/gerrit1/68244
    Tested-by: Jenkins OP Build CI <op-jenkins+hostboot@us.ibm.com>
    Tested-by: FSP CI Jenkins <fsp-CI-jenkins+hostboot@us.ibm.com>
    Reviewed-by: Daniel M. Crowell <dcrowell@us.ibm.com>

and trying hostboot master with this one commit reverted: things work fine!

ghost commented 5 years ago

The FRUs for memory on a Boston where it does not occur:

FRU Device Description : P1-DIMMA1 (ID 12)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 32GiB 64-bit ECC RDIMM
 Product Part Number   : M393A4K40BB2-CTD    
 Product Version       : 00
 Product Serial        : 34ee7214

FRU Device Description : P1-DIMMA2 (ID 13)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 32GiB 64-bit ECC RDIMM
 Product Part Number   : M393A4K40BB2-CTD    
 Product Version       : 00
 Product Serial        : 34ee7216

FRU Device Description : P1-DIMMB1 (ID 14)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 32GiB 64-bit ECC RDIMM
 Product Part Number   : M393A4K40BB2-CTD    
 Product Version       : 00
 Product Serial        : 34ee72ff

FRU Device Description : P1-DIMMB2 (ID 15)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 32GiB 64-bit ECC RDIMM
 Product Part Number   : M393A4K40BB2-CTD    
 Product Version       : 00
 Product Serial        : 34ee6ec7

FRU Device Description : P1-DIMMC1 (ID 16)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 32GiB 64-bit ECC RDIMM
 Product Part Number   : M393A4K40BB2-CTD    
 Product Version       : 00
 Product Serial        : 34ee7300
FRU Device Description : P1-DIMMC2 (ID 17)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 32GiB 64-bit ECC RDIMM
 Product Part Number   : M393A4K40BB2-CTD    
 Product Version       : 00
 Product Serial        : 34ee94ed

FRU Device Description : P1-DIMMD1 (ID 18)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 32GiB 64-bit ECC RDIMM
 Product Part Number   : M393A4K40BB2-CTD    
 Product Version       : 00
 Product Serial        : 34ee7304

FRU Device Description : P1-DIMMD2 (ID 19)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 32GiB 64-bit ECC RDIMM
 Product Part Number   : M393A4K40BB2-CTD    
 Product Version       : 00
 Product Serial        : 34ee7215

FRU Device Description : P2-DIMMA1 (ID 20)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 32GiB 64-bit ECC RDIMM
 Product Part Number   : M393A4K40BB2-CTD    
 Product Version       : 00
 Product Serial        : 34ee8e98

FRU Device Description : P2-DIMMA2 (ID 21)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 32GiB 64-bit ECC RDIMM
 Product Part Number   : M393A4K40BB2-CTD    
 Product Version       : 00
 Product Serial        : 34ee6a93
FRU Device Description : P2-DIMMB1 (ID 22)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 32GiB 64-bit ECC RDIMM
 Product Part Number   : M393A4K40BB2-CTD    
 Product Version       : 00
 Product Serial        : 34ee7301

FRU Device Description : P2-DIMMB2 (ID 23)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 32GiB 64-bit ECC RDIMM
 Product Part Number   : M393A4K40BB2-CTD    
 Product Version       : 00
 Product Serial        : 34ee7302

FRU Device Description : P2-DIMMC1 (ID 24)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 32GiB 64-bit ECC RDIMM
 Product Part Number   : M393A4K40BB2-CTD    
 Product Version       : 00
 Product Serial        : 34ee8ea0

FRU Device Description : P2-DIMMC2 (ID 25)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 32GiB 64-bit ECC RDIMM
 Product Part Number   : M393A4K40BB2-CTD    
 Product Version       : 00
 Product Serial        : 34ee9128

FRU Device Description : P2-DIMMD1 (ID 26)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 32GiB 64-bit ECC RDIMM
 Product Part Number   : M393A4K40BB2-CTD    
 Product Version       : 00
 Product Serial        : 34ee8e30

FRU Device Description : P2-DIMMD2 (ID 27)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 32GiB 64-bit ECC RDIMM
 Product Part Number   : M393A4K40BB2-CTD    
 Product Version       : 00
 Product Serial        : 34ee7211

and another one where it does:

FRU Device Description : P1-DIMMA1 (ID 12)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 16GiB 64-bit ECC RDIMM
 Product Part Number   : M393A2K40BB2-CTD    
 Product Version       : 00
 Product Serial        : 3446d454

FRU Device Description : P1-DIMMA2 (ID 13)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 16GiB 64-bit ECC RDIMM
 Product Part Number   : M393A2K40BB2-CTD    
 Product Version       : 00
 Product Serial        : 3447045f

FRU Device Description : P1-DIMMB1 (ID 14)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 16GiB 64-bit ECC RDIMM
 Product Part Number   : M393A2K40BB2-CTD    
 Product Version       : 00
 Product Serial        : 34470ada

FRU Device Description : P1-DIMMB2 (ID 15)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 16GiB 64-bit ECC RDIMM
 Product Part Number   : M393A2K40BB2-CTD    
 Product Version       : 00
 Product Serial        : 3446e9f6

FRU Device Description : P1-DIMMC1 (ID 16)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 16GiB 64-bit ECC RDIMM
 Product Part Number   : M393A2K40BB2-CTD    
 Product Version       : 00
 Product Serial        : 3447045a

FRU Device Description : P1-DIMMC2 (ID 17)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 16GiB 64-bit ECC RDIMM
 Product Part Number   : M393A2K40BB2-CTD    
 Product Version       : 00
 Product Serial        : 34470210

FRU Device Description : P1-DIMMD1 (ID 18)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 16GiB 64-bit ECC RDIMM
 Product Part Number   : M393A2K40BB2-CTD    
 Product Version       : 00
 Product Serial        : 344709a4

FRU Device Description : P1-DIMMD2 (ID 19)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 16GiB 64-bit ECC RDIMM
 Product Part Number   : M393A2K40BB2-CTD    
 Product Version       : 00
 Product Serial        : 34470245

FRU Device Description : P2-DIMMA1 (ID 20)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 16GiB 64-bit ECC RDIMM
 Product Part Number   : M393A2K40BB2-CTD    
 Product Version       : 00
 Product Serial        : 34470212

FRU Device Description : P2-DIMMA2 (ID 21)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 16GiB 64-bit ECC RDIMM
 Product Part Number   : M393A2K40BB2-CTD    
 Product Version       : 00
 Product Serial        : 34470a41
FRU Device Description : P2-DIMMB1 (ID 22)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 16GiB 64-bit ECC RDIMM
 Product Part Number   : M393A2K40BB2-CTD    
 Product Version       : 00
 Product Serial        : 34470244

FRU Device Description : P2-DIMMB2 (ID 23)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 16GiB 64-bit ECC RDIMM
 Product Part Number   : M393A2K40BB2-CTD    
 Product Version       : 00
 Product Serial        : 344704d7

FRU Device Description : P2-DIMMC1 (ID 24)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 16GiB 64-bit ECC RDIMM
 Product Part Number   : M393A2K40BB2-CTD    
 Product Version       : 00
 Product Serial        : 3447059f

FRU Device Description : P2-DIMMC2 (ID 25)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 16GiB 64-bit ECC RDIMM
 Product Part Number   : M393A2K40BB2-CTD    
 Product Version       : 00
 Product Serial        : 34470533

FRU Device Description : P2-DIMMD1 (ID 26)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 16GiB 64-bit ECC RDIMM
 Product Part Number   : M393A2K40BB2-CTD    
 Product Version       : 00
 Product Serial        : 3447011e

FRU Device Description : P2-DIMMD2 (ID 27)
 Product Manufacturer  : Samsung Electronics
 Product Name          : DDR4-2666 16GiB 64-bit ECC RDIMM
 Product Part Number   : M393A2K40BB2-CTD    
 Product Version       : 00
 Product Serial        : 3446fe73

The commonality between systems where we fail seems to be the 16GB DIMMs

stermole commented 5 years ago

We're currently working to recreate this issue on one of our lab systems