openstreetmap / operations

OSMF Operations Working Group issue tracking
https://operations.osmfoundation.org/
99 stars 12 forks source link

Replace snap-01 failing memory #1109

Closed Firefishy closed 2 months ago

Firefishy commented 4 months ago

Snap-01 has a failing DIMM throwing ECC Correction errors.

CPU_SrcID#1_MC#1_Chan#0_DIMM#0

Which is I think the one marked from this hardware linkage table:

memory stick 'P1-DIMMA1' is located at 'P0_Node0_Channel0_Dimm0'
memory stick 'P1-DIMMA2' is located at 'P0_Node0_Channel0_Dimm1'
memory stick 'P1-DIMMB1' is located at 'P0_Node0_Channel1_Dimm0'
memory stick 'P1-DIMMB2' is located at 'P0_Node0_Channel1_Dimm1'
memory stick 'P1-DIMMC1' is located at 'P0_Node0_Channel2_Dimm0'
memory stick 'P1-DIMMC2' is located at 'P0_Node0_Channel2_Dimm1'

memory stick 'P1-DIMMD1' is located at 'P0_Node1_Channel0_Dimm0'
memory stick 'P1-DIMMD2' is located at 'P0_Node1_Channel0_Dimm1'
memory stick 'P1-DIMME1' is located at 'P0_Node1_Channel1_Dimm0'
memory stick 'P1-DIMME2' is located at 'P0_Node1_Channel1_Dimm1'
memory stick 'P1-DIMMF1' is located at 'P0_Node1_Channel2_Dimm0'
memory stick 'P1-DIMMF2' is located at 'P0_Node1_Channel2_Dimm1'

memory stick 'P2-DIMMA1' is located at 'P1_Node0_Channel0_Dimm0'
memory stick 'P2-DIMMA2' is located at 'P1_Node0_Channel0_Dimm1'
memory stick 'P2-DIMMB1' is located at 'P1_Node0_Channel1_Dimm0'
memory stick 'P2-DIMMB2' is located at 'P1_Node0_Channel1_Dimm1'
memory stick 'P2-DIMMC1' is located at 'P1_Node0_Channel2_Dimm0'
memory stick 'P2-DIMMC2' is located at 'P1_Node0_Channel2_Dimm1'

memory stick 'P2-DIMMD1' is located at 'P1_Node1_Channel0_Dimm0' ****
memory stick 'P2-DIMMD2' is located at 'P1_Node1_Channel0_Dimm1'
memory stick 'P2-DIMME1' is located at 'P1_Node1_Channel1_Dimm0'
memory stick 'P2-DIMME2' is located at 'P1_Node1_Channel1_Dimm1'
memory stick 'P2-DIMMF1' is located at 'P1_Node1_Channel2_Dimm0'
memory stick 'P2-DIMMF2' is located at 'P1_Node1_Channel2_Dimm1'

DMI lists the memory as:

Handle 0x0035, DMI type 17, 84 bytes
Memory Device
        Array Handle: 0x0033
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 32 GB
        Form Factor: DIMM
        Set: None
        Locator: P2-DIMMD1
        Bank Locator: P1_Node1_Channel0_Dimm0
        Type: DDR4
        Type Detail: Synchronous Registered (Buffered)
        Speed: 2666 MT/s
        Manufacturer: Micron Technology
        Serial Number: F0E34EF7
        Asset Tag: P2-DIMMD1_AssetTag (date:20/01)
        Part Number: 36ASF4G72PZ-2G6E1
        Rank: 2
        Configured Memory Speed: 2400 MT/s
        Minimum Voltage: 1.2 V
        Maximum Voltage: 1.2 V
        Configured Voltage: 1.2 V
        Memory Technology: DRAM
        Memory Operating Mode Capability: Volatile memory
        Firmware Version: 0000
        Module Manufacturer ID: Bank 1, Hex 0x2C
        Module Product ID: Unknown
        Memory Subsystem Controller Manufacturer ID: Unknown
        Memory Subsystem Controller Product ID: Unknown
        Non-Volatile Size: None
        Volatile Size: 32 GB
        Cache Size: None
        Logical Size: None
Firefishy commented 4 months ago

I have ordered 2x replacement DIMMs. They should arrive in Catford shortly.

Firefishy commented 4 months ago

As soon as plausible I would like to reboot the system to ensure that ADDDC is enabled in the BIOS.

On a successful boot with ADDDC enabled I would then like to upgrade the BIOS to the latest revision 3.2 -> 4.2. snap-02 has already been upgraded.

Firefishy commented 4 months ago

I don't want to jinx it, but it looks like the memory errors have stopped for now. Note to reader: Corrected ECC Errors, not Uncorrected ECC errors.

We scheduled a 1 hour maintenance today where I performed the following:

All options above were first tested on the twin snap-02.

Firefishy commented 4 months ago

We discussed the RAM replacement at the 11 July 2024 Ops call. We will aim to replace the memory in the server in the next 3 months. The server is no longer throwing errors and is not urgent priority.

Firefishy commented 4 months ago

In the event the RAM starts throwing errors we will treat it as urgent.

Firefishy commented 4 months ago

2x DIMMs are in-stock @ Catford.

Unfortunately not possible to tell what revision is insallled in snap-01. Stock is 2 different revisions.

Firefishy commented 4 months ago

I've been able to identify the FULL RAM model + revision: 36ASF4G72PZ-2G6E1QG from photos. Unfortunately neither of those I've ordered are an exact match.

Exact match: https://www.ebay.nl/itm/155164317853

Firefishy commented 3 months ago

I have ordered the exact memory module. It will arrive in Catford in a few days.

Firefishy commented 3 months ago

Matching memory module has arrived in Catford stock.

Firefishy commented 2 months ago

Memory ready and maintenance window scheduled for today: https://community.openstreetmap.org/t/openstreetmap-maintenance-26-september-2024/118989

Firefishy commented 2 months ago

Memory replaced.