sscargal / pmemchk

MIT License
0 stars 1 forks source link

[Optane] Module Shuffle - Identify PMem modules installed in the wrong slot(s) #162

Open sscargal opened 2 years ago

sscargal commented 2 years ago

See also #115

If a motherboard is replaced, PMem and DDR modules are removed. If the PMem is not installed back to their original sockets [and DIMM slots], the Regions may be in an 'Error' state, eg:

# ipmctl show -region
 SocketID | ISetID             | PersistentMemoryType | Capacity     | FreeCapacity | HealthState
==================================================================================================
 0x0000   | 0x0000000000000000 | AppDirect            | 2024.000 GiB | 2024.000 GiB | Error
 0x0000   | 0x0000000000000000 | AppDirect            | 2024.000 GiB | 2024.000 GiB | Error

pmemcheck can currently detect this situation and will report the following:

=======================================================================
Starting analysis of the data
=======================================================================
[ PASSED   ] linux_check_full_file_systems : All file systems are <75% full.
[ PASSED   ] linux_check_memmap : No 'memmap' entries found in the Linux boot command line (/proc/cmdline)
[ PASSED   ] optane_check_dimm_fwupdatestatus : All PMem modules have no staged Firmware Update
[ PASSED   ] optane_check_cpu_is_supported : CPU Model '8358' supports Intel Optane Persistent Memory Modules.
[ PASSED   ] optane_check_dimm_extended_adr_enabled : All PMem modules have ExtendedAdrEnabled switch disabled
[ PASSED   ] optane_check_media_temperature_injection_enabled : All PMem modules have thier MediaTemperature Injection Disabled
[ PASSED   ] optane_check_dimm_media_temperature_injections_count : All PMem modules have 0 Media Temperature Injection Counter
[ INFO     ] optane_check_dimm_ppc_extended_adr_enabled : One or more PMem modules have PpcExtendedAdrEnabled=False (Disabled)
[ PASSED   ] optane_check_dimm_software_triggers_enabled_details : All PMem modules have software trigger disabled
[ PASSED   ] optane_check_dimm_ait_dram_enabled : One or more PMem modules have AitDramEnabled=False (Disabled)
[ PASSED   ] optane_check_dimm_arsstatus : All PMem module ARSStatus is Completed
[ PASSED   ] optane_check_dimm_boot_status : All PMem modules have successful BootStatus
[ PASSED   ] optane_check_dimm_capacity : All PMem Modules have the same capacity.
  [ INFO     ] optane_check_dimm_configurationstatus : Multiple (1) ConfigurationStatus DIMM states found!
  [ INFO     ] optane_check_dimm_configurationstatus : Failed - Broken interleave = 0x0001,0x0011,0x0101,0x0111,0x0201,0x0211,0x0301,0x0311,0x1001,0x1011,0x1101,0x1111,0x1201,0x1211,0x1301,0x1311
[ FAILED   ] optane_check_dimm_configurationstatus : One or more PMem modules reported ConfigurationStatus != 'Valid'
[ PASSED   ] optane_check_dimm_error_injection_enabled : All PMem modules have error injection disabled
[ PASSED   ] optane_check_dimm_fw_error_count : All PMem Modules have less than 10 FWError Counts
[ PASSED   ] optane_check_dimm_firmware_version : All PMem Modules have the same firmware.
[ PASSED   ] optane_check_dimm_health_status : All PMem Modules are Healthy
[ PASSED   ] optane_check_dimm_lockstate : All PMem Module Security LockState is good
[ PASSED   ] optane_check_dimm_manageabilitystate : One or more PMem modules reported ManageabilityState as Unmanageable
[ INFO     ] optane_check_dimm_memoryboostfeature : Memory Bandwidth Boot Feature is Disabled
  [ INFO     ] optane_check_dimm_overwrite_status : 1 Overwrite Status found:
  [ INFO     ] optane_check_dimm_overwrite_status : Unknown = 0x0001,0x0011,0x0101,0x0111,0x0201,0x0211,0x0301,0x0311,0x1001,0x1011,0x1101,0x1111,0x1201,0x1211,0x1301,0x1311
[ PASSED   ] optane_check_dimm_overwrite_status : All PMem modules have Overwrite Status == '0
[ PASSED   ] optane_check_dimm_packagesparesavailable : All PMem modules have 1 PackageSparesAvailable
[ PASSED   ] optane_check_dimm_packagesparingcapable : All PMem modules have PackageSparingCapable = True
[ PASSED   ] optane_check_dimm_packagesparingenabled : All PMem modules have PackageSparingEnabled = True
[ PASSED   ] optane_check_dimm_partnumber : All PMem modules have the same Part Number
[ PASSED   ] optane_check_dimm_percentage_remaining : All PMem Modules have 100% life remaining
[ PASSED   ] optane_check_dimm_poision_error_injection_and_clear_counter : All PMem modules have equal PosionErrorInjectionCounter and PoisionErrorCounter
[ PASSED   ] optane_check_dimm_population : Detected 16 PMem modules. Population looks good.
[ PASSED   ] optane_check_dimm_reservedcapacity : No PMem modules have ReservedCapacity configured
[ PASSED   ] optane_check_dimm_show_goal : There are no pending goals
[ PASSED   ] optane_check_dimm_skuviolation : No PMem modules reported a SKU Violation
[ PASSED   ] optane_check_dimm_software_trigger_counter : All PMem modules have Software Trigger Counter == 0
[ PASSED   ] optane_check_dimm_thermalthrottlelosspercent : All PMem modules have zero ThermalThrottleLossPercent events
[ PASSED   ] optane_check_dimm_viral_policy : All PMem modules have Viral Policy disabled
[ PASSED   ] optane_check_dimm_viral_state : All PMem modules are not ViralState == 0
[ INFO     ] optane_check_masterpassphraseenabled : Master Passphrase is Disabled
[ PASSED   ] optane_check_region_capacity : All Regions have the same Capacity
[ PASSED   ] optane_check_region_freecapacity : All Regions have the same FreeCapacity
  [ INFO     ] optane_check_region_health : Multiple (1) Region Health states found!
  [ INFO     ] optane_check_region_health : Error = 0x0000,0x0000
[ FAILED   ] optane_check_region_health : One or more regions reported a non-Healthy status. See previous errors.
[ PASSED   ] optane_check_region_persistentmemorytype : All Regions have the same PersistentMemoryType configuration
=======================================================================
Data analysis completed
=======================================================================

Specifically the ConfigurationStatus == 'Broken Interleave':

  [ INFO     ] optane_check_dimm_configurationstatus : Multiple (1) ConfigurationStatus DIMM states found!
  [ INFO     ] optane_check_dimm_configurationstatus : Failed - Broken interleave = 0x0001,0x0011,0x0101,0x0111,0x0201,0x0211,0x0301,0x0311,0x1001,0x1011,0x1101,0x1111,0x1201,0x1211,0x1301,0x1311

Using the LSA and PCD, we should be able to piece it back together.