intel / ipmctl

BSD 3-Clause "New" or "Revised" License
183 stars 62 forks source link

Ipmctl-03.00.00.0468 show AEP memory Non-functional and fail to upgrade AEP Firmware. #206

Open jlin127 opened 1 year ago

jlin127 commented 1 year ago

I am trying to two AEP memories on a H3C R4700 G3 platform, but it can't be used. All reports of pmemchk is attathed.

ndctl version: 71.1 ipmctl version: 03.00.00.0468

ipmctl show -dimm

DimmID | Capacity | LockState | HealthState | FWVersion

0x0001 | 126.742 GiB | Disabled | Non-functional | 01.00.00.4178 0x0101 | 126.742 GiB | Disabled | Non-functional | 01.00.00.4178

ipmctl show -topology

DimmID | MemoryType | Capacity | PhysicalID| DeviceLocator

0x0001 | Logical Non-Volatile Device | 0.000 GiB | 0x003c | CPU0_A1 0x0101 | Logical Non-Volatile Device | 0.000 GiB | 0x0045 | CPU0_D1 N/A | DDR4 | 0.000 GiB | 0x003b | CPU0_A0 N/A | DDR4 | 16.000 GiB | 0x003e | CPU0_B0 N/A | DDR4 | 16.000 GiB | 0x0041 | CPU0_C0 N/A | DDR4 | 0.000 GiB | 0x0044 | CPU0_D0 N/A | DDR4 | 16.000 GiB | 0x0047 | CPU0_E0 N/A | DDR4 | 16.000 GiB | 0x004a | CPU0_F0

ipmctl show -dimm -sensor

DimmID | Type | CurrentValue

0x0001 | Health | Fatal failure 0x0001 | MediaTemperature | 43C 0x0001 | ControllerTemperature | 46C 0x0001 | PercentageRemaining | 100% 0x0001 | LatchedDirtyShutdownCount | 1 0x0001 | PowerOnTime | 844073s 0x0001 | UpTime | 422022s 0x0001 | PowerCycles | 113 0x0001 | FwErrorCount | 0 0x0001 | UnlatchedDirtyShutdownCount | 42 0x0101 | Health | Fatal failure 0x0101 | MediaTemperature | 47C 0x0101 | ControllerTemperature | 48C 0x0101 | PercentageRemaining | 100% 0x0101 | LatchedDirtyShutdownCount | 0 0x0101 | PowerOnTime | 439924s 0x0101 | UpTime | 422025s 0x0101 | PowerCycles | 32 0x0101 | FwErrorCount | 0 0x0101 | UnlatchedDirtyShutdownCount | 7

Then, I try to upgrade AEP Firmware. It also fail.

ipmctl load -source ./fw_ekvb0_1.2.0.5446_rel.bin -dimm

Starting update on 2 PMem module(s)... pmemchk-log.zip

Load FW failed: Error 2 - Command not run

sscargal commented 1 year ago

Q) Where did the Optane modules come from? Q) Are the Optane modules new or used? If they were installed in another system, we may need to factory reset them.

The first task is to resolve the "Non-functional" status. This means:

           ·   Non-functional: The PMem module is detected and manageable, though some commands
               and capabilities may be limited. The PMem module has limited communication or
               another error preventing complete functionality. Common causes include:

               ·   DDRT memory interface training failure

               ·   Expected region mapping to SPA range unable to be found

I'll break down the investigation into sub-sections.

Optane Modules

Please ensure the Optane Modules were sourced from a known supplier or originate from a good source. If not, this could be challenging to restore their functionality.

If the modules were originally installed in another host, or hosts, a factory reset may be required.

CPU Support

The Intel Optane Modules are Apache Pass (Optane 100). These modules are supported only by the Intel Xeon Cascade Lake (aka 2nd Generation Intel® Xeon® Scalable Processors). I can't determine what CPUs you have in this system as the BIOS is returning an unknown ID:

From dmidecode

Processor Information
        Socket Designation: Processor 2
        Type: Central Processor
        Family: <OUT OF SPEC>
        Manufacturer: Not Specified
        ID: 00 00 00 00 00 00 00 00
        Version: Not Specified
        Voltage: Unknown
        External Clock: Unknown
        Max Speed: 4000 MHz
        Current Speed: Unknown
        Status: Unpopulated
        Upgrade: Socket LGA3647-1
        L1 Cache Handle: Not Provided
        L2 Cache Handle: Not Provided
        L3 Cache Handle: Not Provided
        Serial Number: Not Specified
        Asset Tag: Not Specified
        Part Number: Not Specified
        Characteristics: None

The Socket is "LGA3647" which supports Skylake, Cascade Lake, and Xeon Phi. If you have a Skylake CPU, it will not support Optane, which would cause the issue.

lscpu can only show what the BIOS presents

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   46 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          48
On-line CPU(s) list:             0-47
Vendor ID:                       GenuineIntel
BIOS Vendor ID:                  Intel(R) Corporation
Model name:                      Genuine Intel(R) CPU 0000%@
BIOS Model name:                 Genuine Intel(R) CPU 0000%@
CPU family:                      6
Model:                           85

ipmctl

The Optane 100 modules were intended to be used with ipmctl version 1.x.x. Optane 200 with ipmctl version 2.x.x, and Optane 300 with ipmctl version 3.x.x. You are using 3.x.x, which may not fully support the 1st generation modules. Try ipmctl version 1.x.x to see if this improves your manageability of the modules.

BIOS Support

Look in the BIOS and you should find one or more sub-menus under 'Advanced -> Memory' that allow you to correctly configure the platform in either Memory Mode or App Direct. Your OEM/BIOS vendor determines the exact location, so please take a look at their BIOS manual. This is orthogonal to ipmctl and should be used as the first approach. You'll find much of the same functionality that ipmctl provides in the BIOS, so I recommend you use this interface.

Some OEMs distributed ipmctl in the UEFI also.

Suggested Next Actions

  1. Identify the CPU Model number and generation - Skylake or Cascade Lake If Skylake. STOP. This CPU does not support Optane PMem If Cascade Lake. CONTINUE. But try to resolve why the CPU ID isn't working. This is a HW issue so you'll need to work with your server support team to resolve it
  2. Use the Optane Memory BIOS options to configure the memory in AppDirect or Memory Mode If MemoryMode, no further action is required in the Linux OS. The capacity of PMem will be shown. The capacity of DRAM will not. If AppDirect, you should use ndctl to create the namespaces - devdax, fsdax, sector.

HTH

jlin127 commented 1 year ago

@sscargal Thanks.

Q) Where did the Optane modules come from? Q) Are the Optane modules new or used? If they were installed in another system, we may need to factory reset them.

CPU Support

CPU is Cascade Lake, the CPUID is 050655h in the CPU-Z. And only one cpu is used in the dual cpu motherboards, does it matter?

dmidecode

Processor Information Socket Designation: Processor 1 Type: Central Processor Family: Xeon Manufacturer: Intel(R) Corporation ID: 55 06 05 00 FF FB EB BF Signature: Type 0, Family 6, Model 85, Stepping 5 Flags: FPU (Floating-point unit on-chip) VME (Virtual mode extension) DE (Debugging extension) PSE (Page size extension) TSC (Time stamp counter) MSR (Model specific registers) PAE (Physical address extension) MCE (Machine check exception) CX8 (CMPXCHG8 instruction supported) APIC (On-chip APIC hardware supported) SEP (Fast system call) MTRR (Memory type range registers) PGE (Page global enable) MCA (Machine check architecture) CMOV (Conditional move instruction supported) PAT (Page attribute table) PSE-36 (36-bit page size extension) CLFSH (CLFLUSH instruction supported) DS (Debug store) ACPI (ACPI supported) MMX (MMX technology supported) FXSR (FXSAVE and FXSTOR instructions supported) SSE (Streaming SIMD extensions) SSE2 (Streaming SIMD extensions 2) SS (Self-snoop) HTT (Multi-threading) TM (Thermal monitor supported) PBE (Pending break enabled) Version: Genuine Intel(R) CPU 0000%@ Voltage: 1.6 V External Clock: 100 MHz Max Speed: 4000 MHz Current Speed: 2200 MHz Status: Populated, Enabled Upgrade: Socket LGA3647-1 L1 Cache Handle: 0x0062 L2 Cache Handle: 0x0063 L3 Cache Handle: 0x0064 Serial Number: Not Specified Asset Tag: UNKNOWN Part Number: Not Specified Core Count: 24 Core Enabled: 24 Thread Count: 48 Characteristics: 64-bit capable Multi-Core Hardware Thread Execute Protection Enhanced Virtualization Power/Performance Control

cat /proc/cpuinfo

..... processor : 47 vendor_id : GenuineIntel cpu family : 6 model : 85 model name : Genuine Intel(R) CPU 0000%@ stepping : 5 microcode : 0x3000012 cpu MHz : 2058.056 cache size : 33792 KB physical id : 0 siblings : 48 core id : 29 cpu cores : 24 apicid : 59 initial apicid : 59 fpu : yes fpu_exception : yes cpuid level : 22 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts pku ospke md_clear flush_l1d arch_capabilities vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs pml ept_mode_based_exec tsc_scaling bugs : spectre_v1 spectre_v2 spec_store_bypass mds swapgs taa itlb_multihit mmio_stale_data retbleed bogomips : 4400.00 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management:

ipmctl

I have try ipmctl version 1.x.x and 2.x.x before, it also failed to use in Non-functional status. Then I will try ipmctl-01.00.00.3072 and ipmctl-01.00.00.3547 again.

BIOS Support

In the BIOS manual, it show the Optane Modules is supported. The "CurrentVolatileMode=1LM" shows in the file "ipmctlshow-system” of the pmemchk-log.zip, and how can I set it to 2LM? I want to use them in MemoryMode, but it is in Non-functional status.

sscargal commented 1 year ago

Thanks for the background info, it helps.

Given the PMem modules were previously installed in another host, this could be one reason for the current problems. The configuration of the Regions and interleaving is stored on the PMem in the Platform Config Data (PCD), and the BIOS tries to reconstruct this during POST. In the pmemchk output, the ipmctl tool is core dumping when trying to collect this information, so I can't see how many modules should be part of the interleave set. If you don't put all the PMem modules back in the same slots with their interleave friends, the memory training will fail and result in 'Non-Functional' state. Given you don't know how many PMem modules were installed per socket, I recommend trying to Factory Reset them.

The "CurrentVolatileMode=1LM" shows in the file "ipmctlshow-system” of the pmemchk-log.zip, and how can I set it to 2LM?

I found 3 BIOS manuals under the "Config and Deploy" for your server documentation. Specifically, the H3C Servers Purley Platform Text-Mode BIOS User Guide has what you need in the "Intel(R) Optane(TM) DC Persistent Memory Configuration submenu" on Page 86.

I see a 'Secure Erase' option in the Security menu (Page 101). You can try that to see if the DIMMs will Factory Reset.

Page 103 shows the 'Regions submenu screen' where you can 'Delete Goal' and 'Create Goal'. You want to create a new goal for 'Memory Mode (2LM)' to switch from the current AppDirect mode. I don't see any staged goals, so the 'Delete Goal' option may not be visible or may not do anything.

StevenPontsler commented 1 year ago

Thanks for the follow up Steve.

I would also think a secure erase/factory reset would be the thing to try.

jlin127 commented 1 year ago

OK,I will reboot into the bios to check the possibility of Secure Erase/Factory Reset.

StevenPontsler commented 1 year ago

@jlin127 - Have you been able to get your Modules working? if so please leave a comment of what fixed your situation and close this thread.

jlin127 commented 1 year ago

@StevenPontsler Sorry, it has some taskes running in the server now and it will still take more than three days. I will return the report as soon as possible after trying Factory Reset.