intel / ipmctl

BSD 3-Clause "New" or "Revised" License
183 stars 62 forks source link

Unable to use PMem? #189

Closed Haroldll closed 2 years ago

Haroldll commented 2 years ago

I have installed intel's PMem memory stick on my server, but it can't be used.

ipmctl show -dimm DimmID | Capacity | LockState | HealthState | FWVersion 0x1101 | 126.422 GiB | Disabled | Non-functional | 01.02.00.5355

HealthState is Non-functional.

ipmctl show -memoryresources MemoryType | DDR | PMemModule | Total Volatile | 175.750 GiB | 0.000 GiB | 175.750 GiB AppDirect | - | 0.000 GiB | 0.000 GiB Cache | 0.000 GiB | - | 0.000 GiB Inaccessible | 0.250 GiB | 126.422 GiB | 126.672 GiB Physical | 176.000 GiB | 126.422 GiB | 302.422 GiB

ipmctl show -topology DimmID | MemoryType | Capacity | PhysicalID| DeviceLocator 0x1101 | Logical Non-Volatile Device | 0.000 GiB | 0x003c | DIMM20 N/A | DDR4 | 16.000 GiB | 0x0026 | DIMM1 N/A | DDR4 | 16.000 GiB | 0x0028 | DIMM3 N/A | DDR4 | 16.000 GiB | 0x002a | DIMM5 N/A | DDR4 | 16.000 GiB | 0x002d | DIMM7 N/A | DDR4 | 16.000 GiB | 0x002f | DIMM9 N/A | DDR4 | 16.000 GiB | 0x0031 | DIMM11 N/A | DDR4 | 16.000 GiB | 0x0034 | DIMM13 N/A | DDR4 | 16.000 GiB | 0x0036 | DIMM15 N/A | DDR4 | 16.000 GiB | 0x0038 | DIMM17 N/A | DDR4 | 0.000 GiB | 0x003b | DIMM19 N/A | DDR4 | 16.000 GiB | 0x003d | DIMM21 N/A | DDR4 | 16.000 GiB | 0x003f | DIMM23

ipmctl show -system -capabilities PlatformConfigSupported=1 Alignment=1.000 GiB AllowedVolatileMode=Memory Mode CurrentVolatileMode=1LM AllowedAppDirectMode=App Direct

But I can't create it with the following command: ipmctl create -goal PersistentMemoryType=AppDirect No functional PMem modules in the system.

sscargal commented 2 years ago

@Haroldll

Unrelated to your question, you have a failed DDR4 module (DIMM19) which explains why you have 176GB of Memory (11x16) vs. 196GB (12x16). You'll want to replace that DDR module at your earliest convenience.

It's plausible that the combination of the failed DDR module + only having one PMem module put you into a non-supported configuration, which would fail memory training during POST, thus making the PMem module "Non-Functional." You should check with your motherboard vendor documentation to see what memory population configurations are supported. If this is a two-socket system, we should populate each socket with the same number of PMem and DDR modules. If this is a 2-socket system, this unbalanced population may or may not be supported by the BIOS.

I want to collect some data from your system to see if we can get a better explanation for the root cause.

  1. Please download pmemchk.
    1. You can either git clone https://github.com/sscargal/pmemchk or go to the project page, click the green 'Code' button and select 'Download ZIP'.
    2. Run the collector only ./pmemchk -C | tee pmemchk.out
    3. Compress the resulting output directory with tar -czf pmemchk.tar.gz <directory>
    4. Attach the pmemchk.out and tar.gz file to this issue
    5. If you encounter any issues, please let me know. This tool is still in early development, but it is intended to help with situations such as this.

Thank you

StevenPontsler commented 2 years ago

It looks like their is likely a problem at a lower level.

What OS are you using?

Is the dimm an Intel Optane Persistent Memory series 100 or 200? If it is a 200 series you will need to upgrade to a 2.x version of ipmctl.

Does the dimm have a healthy health state in the BIOS screens?

Haroldll commented 2 years ago

@sscargal When I use the command ./pmemchk -C | tee pmemchk.out,only pmemchk.out is generated, no output directory is seen. And the pmemchk.out file is empty.

Regarding the problem of bad memory, I have another server, which is a little different from this one. See: ipmctl show -dimm DimmID | Capacity | LockState | HealthState | FWVersion 0x1101 | 126.422 GiB | Disabled | Non-functional | 01.02.00.5355

ipmctl show -memoryresources MemoryType | DDR | PMemModule | Total Volatile | 191.000 GiB | 0.000 GiB | 191.000 GiB AppDirect | - | 0.000 GiB | 0.000 GiB Cache | 0.000 GiB | - | 0.000 GiB Inaccessible | 1.000 GiB | 126.422 GiB | 127.422 GiB Physical | 192.000 GiB | 126.422 GiB | 318.422 GiB

ipmctl show -topology DimmID | MemoryType | Capacity | PhysicalID| DeviceLocator N/A | DDR4 | 32.000 GiB | 0x0026 | DIMM1 N/A | DDR4 | 32.000 GiB | 0x0028 | DIMM3 N/A | DDR4 | 32.000 GiB | 0x0029 | DIMM4 N/A | DDR4 | 32.000 GiB | 0x0030 | DIMM9 N/A | DDR4 | 32.000 GiB | 0x0032 | DIMM11 N/A | DDR4 | 32.000 GiB | 0x0033 | DIMM12 Could not find new PMem memory here.

ipmctl show -system -capabilities PlatformConfigSupported=1 Alignment=1.000 GiB AllowedVolatileMode=Memory Mode CurrentVolatileMode=1LM AllowedAppDirectMode=App Direct

egrep -i "zone_device|hugepage|nfit|nvdimm|pmem|nd|btt|dax|memory_hotplug" /boot/config-$(uname -r) CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD=y CONFIG_MEMORY_HOTPLUG=y CONFIG_MEMORY_HOTPLUG_SPARSE=y # CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE is not set CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION=y CONFIG_TRANSPARENT_HUGEPAGE=y CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y # CONFIG_TRANSPARENT_HUGEPAGE_MADVISE is not set CONFIG_ARCH_HAS_ZONE_DEVICE=y CONFIG_ZONE_DEVICE=y CONFIG_X86_PMEM_LEGACY_DEVICE=y CONFIG_X86_PMEM_LEGACY=m CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y CONFIG_ACPI_NFIT=m # CONFIG_NFIT_SECURITY_DEBUG is not set CONFIG_LIBNVDIMM=m CONFIG_BLK_DEV_PMEM=m CONFIG_ND_BLK=m CONFIG_ND_CLAIM=y CONFIG_ND_BTT=m CONFIG_BTT=y CONFIG_ND_PFN=m CONFIG_NVDIMM_PFN=y CONFIG_NVDIMM_DAX=y CONFIG_NVDIMM_KEYS=y CONFIG_DAX_DRIVER=y CONFIG_DAX=y CONFIG_DEV_DAX=m CONFIG_DEV_DAX_PMEM=m CONFIG_DEV_DAX_PMEM_COMPAT=m CONFIG_FS_DAX=y CONFIG_FS_DAX_PMD=y CONFIG_ARCH_HAS_PMEM_API=y

Haroldll commented 2 years ago

@StevenPontsler My OS is cent os 8.2, Linux version 4.18.0-193.14.2.el8_2.x86_64. It is Intel Optane Persistent Memory series 100 , NMA1XXD128GPSU4. ipmctl-02.00.00.3885-1.el8.x86_64 ndctl-71.1-2.el8.x86_64

BMC: image BIOS: image image

sscargal commented 2 years ago

Thanks for the update. I'll investigate the pmemchk issue on CentOS as it works fine on Fedora.

These are the descriptions of the Health and Manageability states:

HealthState
           (Default) Overall PMem module health. One of:

           •   Healthy

           •   Noncritical: Maintenance may be required.

           •   Critical: Features or performance are degraded due to failure.

           •   Fatal: Critical internal state failure (DPA Failure, Internal Buffer Failure, AIT Failure, etc.) is non-recoverable and data loss has occurred or is imminent. In this case, the firmware will disable the media and access to user data and operations that require use of the media will fail.

           •   Non-functional: The PMem module is detected and manageable, though some commands and capabilities may be limited. The PMem module has limited communication or another error preventing complete functionality. Common causes include:

               •   DDRT memory interface training failure

               •   Expected region mapping to SPA range unable to be found

           •   Unmanageable: The PMem module has an incompatible firmware API version or hardware revision or is unresponsive (possibly due to a communication interface failure or a firmware/hardware error).

...

ManageabilityState
           Ability of the PMem module host software to manage the PMem module. Manageability is determined by the interface format code, the vendor identifier, device identifier and the firmware API version. One of:

           •   Manageable: The PMem module is manageable by the software.

           •   Unmanageable: The PMem module is not supported by this version of the software.

Q) What server Manufacturer and Model are you using? Please provide dmidecode Q) Are these newly purchased PMem modules? What's the history of these modules? Did you move them from a previous working system?

I recommend updating the BIOS if there's one available for your server/motherboard. The BIOS version (2.20.1276) may be incompatible with the PMem Firmware (01.02.00.5355) such that the PMem Firmware is too new or too old. OEM server vendors commonly distribute the BIOS and PMem firmware together to ensure compatibility.

Haroldll commented 2 years ago

Q) What server Manufacturer and Model are you using? Please provide dmidecode Manufacturer is ZTE Corporation. Model is ZXCLOUD R5300 G4 and ZXCLOUD R5500 G4.

R5500: dmidecode dmidecode 3.2 Getting SMBIOS data from sysfs. SMBIOS 3.2.1 present. SMBIOS implementations newer than version 3.2.0 are not fully supported by this version of dmidecode. Table at 0x000E9690.

Handle 0x0000, DMI type 0, 26 bytes BIOS Information Vendor: ZTE Version: 03.15.0100_70562 Release Date: 03/04/2020 Address: 0xF0000 Runtime Size: 64 kB ROM Size: 32 MB Characteristics: PCI is supported BIOS is upgradeable BIOS shadowing is allowed Boot from CD is supported Selectable boot is supported EDD is supported Print screen service is supported (int 5h) Serial services are supported (int 14h) Printer services are supported (int 17h) ACPI is supported USB legacy is supported BIOS boot specification is supported Targeted content distribution is supported UEFI is supported BIOS Revision: 3.15

Handle 0x0036, DMI type 17, 40 bytes Memory Device Array Handle: 0x0034 Error Information Handle: Not Provided Total Width: 72 bits Data Width: 64 bits Size: No Module Installed Form Factor: DIMM Set: None Locator: DIMM14 Bank Locator: Cpu1_Channel3_Dimm1 Type: Logical non-volatile device Type Detail: Synchronous Registered (Buffered) Speed: 2666 MT/s Manufacturer: Intel Serial Number: 0000192C Asset Tag: DIMM14_AssetTag Part Number: Rank: 1 Configured Memory Speed: 2400 MT/s Minimum Voltage: 1.2 V Maximum Voltage: 1.2 V Configured Voltage: 1.2 V

R5300: dmidecode Handle 0x0000, DMI type 0, 26 bytes BIOS Information Vendor: ZTE Version: 03.19.0100_8717837 Release Date: 04/09/2021 Address: 0xF0000 Runtime Size: 64 kB ROM Size: 32 MB Characteristics: PCI is supported BIOS is upgradeable BIOS shadowing is allowed Boot from CD is supported Selectable boot is supported EDD is supported Print screen service is supported (int 5h) Serial services are supported (int 14h) Printer services are supported (int 17h) ACPI is supported USB legacy is supported BIOS boot specification is supported Targeted content distribution is supported UEFI is supported BIOS Revision: 3.19

Q) Are these newly purchased PMem modules? What's the history of these modules? Did you move them from a previous working system? PMem modules is newly purchased, servers is more than a year old.

Also, my server cpu is Intel(R) Xeon(R) Silver 4114 and Intel(R) Xeon(R) Silver 4110. Is it not supported by the cpu?

When asked for the first time, the memory stick was broken, I found some regularities: Appears if there is additional memory next to the newly inserted memory module. If there is no other memory in that memory bank, and only this memory is inserted, it will not cause a memory corruption. But in this case, the newly inserted memory module cannot be seen through the command ipmctl show -topology.

One question, only one PMem is inserted into each server, is this supported?

sscargal commented 2 years ago

Also, my server cpu is Intel(R) Xeon(R) Silver 4114 and Intel(R) Xeon(R) Silver 4110. Is it not supported by the cpu?

Correct. Those are Skylake CPUs that are not supported with Intel Optane Persistent Memory. You need to use Cascade Lake with Intel Optane 100 Series - x2xx series Xeon CPUs. This link takes you to the Advanced Search Feature of akr.intel.com where you can find all the CPUs that support PMem. You can add your own filters to find the right CPU for you based on the # of Cores, TDP, Max memory, etc, etc. The only Silver CPU that supports PMem 100 Series is the Intel® Xeon® Silver 4215 Processor

Skylake and Cascade Lake share the same Purley platform (Motherboard design), but only Cascade Lake has all the logic to support PMem along with a supporting BIOS release.

This would certainly explain the issue.

Haroldll commented 2 years ago

Thanks, it's resolved.