intel / ipmctl

BSD 3-Clause "New" or "Revised" License
183 stars 62 forks source link

ipmctl show -dimm segfaults on bad SMBIOS table #181

Open kjacque opened 2 years ago

kjacque commented 2 years ago

Version: Intel(R) Optane(TM) Persistent Memory Command Line Interface Version 02.00.00.3885 OS: CentOS Linux release 8.4.2105

While trying to debug a problem with some faulty DIMMs, I see a segfault while running ipmctl show -dimm. It is always reproducible, in the state this machine is in anyway.

[root@wolf-143 ~]# ipmctl show -dimm
Segmentation fault (core dumped)

Running in verbose mode, I see the following errors immediately before the segfault:

NVM_DBG_LOGGER NVDIMM-WARN:NvmDimmConfig.c::FillSmbiosInfo:592: Failed to retrieve the device locator from SMBIOS table (0x0)

NVM_DBG_LOGGER NVDIMM-WARN:NvmDimmConfig.c::FillSmbiosInfo:599: Failed to retrieve the bank locator from SMBIOS table (0x0)

NVM_DBG_LOGGER NVDIMM-WARN:NvmDimmConfig.c::FillSmbiosInfo:606: Failed to retrieve the manufacturer string from SMBIOS table (0x0)

Here's the call stack pulled from gdb:

Starting program: /usr/bin/ipmctl show -dimm
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7951e92 in FillSmbiosInfo () from /lib64/libipmctl.so.4
(gdb) bt
#0  0x00007ffff7951e92 in FillSmbiosInfo () from /lib64/libipmctl.so.4
#1  0x00007ffff79524f5 in GetDimmInfo () from /lib64/libipmctl.so.4
#2  0x00007ffff7953989 in GetDimm () from /lib64/libipmctl.so.4
#3  0x00007ffff789eb30 in GetAllDimmList () from /lib64/libipmctl.so.4
#4  0x00007ffff78b8004 in ShowDimms () from /lib64/libipmctl.so.4
#5  0x00007ffff789e513 in ExecuteCmd () from /lib64/libipmctl.so.4
#6  0x00007ffff78ae1f6 in UefiMain () from /lib64/libipmctl.so.4
#7  0x00007ffff79722f9 in nvm_run_cli () from /lib64/libipmctl.so.4
#8  0x00007ffff707d493 in __libc_start_main () from /lib64/libc.so.6
#9  0x000055555555480e in _start ()
StevenPontsler commented 2 years ago

Can you post the verbose output? Or email it to me directly (my address is in my profile)

If you what dimms are there can you try the command for each of the individual dimms? For example ipmctl show -dimm 0x0001

kjacque commented 2 years ago

Alas, our datacenter powercycled the machine, and the DIMM came back up normally. I didn't manage to capture the full output beforehand.

I did save some output from the ipmctl start -diagnostic command though, if that can help you track down the source of the segfault:

[root@wolf-143 ~]# ipmctl start -diagnostic

--Test = Quick
   State = Warning
   --SubTest = Manageability
      State = Warning
      Message.1 = The quick health check detected that PMem module 0x0101 is not manageable because firmware API version N/A is not supported.
   --SubTest = Boot status
      State = Ok
   --SubTest = Health
      State = Ok

--Test = Config
   State = Failed
   --SubTest = PMem module specs
      State = Failed
      Message.1 = The platform configuration check detected that PMem module with physical ID 0x0101 is present in the system but failed to initialize.
   --SubTest = Duplicate PMem module
      State = Ok
   --SubTest = System Capability
      State = Ok
   --SubTest = Namespace LSA
      State = Ok
   --SubTest = PCD
      State = Ok

--Test = Security
   State = Ok
   Message = The security check succeeded.
   --SubTest = Encryption status
      State = Ok
   --SubTest = Inconsistency
      State = Ok

--Test = FW
   State = Ok
   Message = The firmware consistency and settings check succeeded.
   --SubTest = FW Consistency
      State = Ok
   --SubTest = Viral Policy
      State = Ok
   --SubTest = Threshold check
      State = Ok
   --SubTest = System Time
      State = Ok
[root@wolf-143 ~]#

When I was poking around in dmidecode it looked like the SMBIOS tables weren't showing much of anything for physical ID 0x101.

StevenPontsler commented 2 years ago

Thanks. Glad to hear that the dimm came back up in a good state.

We will try to figure out the cause of the seg fault.

nolanhergert commented 2 years ago

Found the root cause, thanks @kjacque!

Can you collect the nlogs and provide them to @sscargal? They might provide additional information on why the PMem module failed.

ipmctl dump -destination out -dict <dict.txt> -dimm. The dict.txt should be in the firmware image distributed to customers. If you have any questions, please talk with @sscargal.

kjacque commented 2 years ago

@nolanhergert I've given up access to the DIMMs but I'll forward the information to our datacenter sysadmins in case they see something similar again. Thanks for the quick response. :)

StevenPontsler commented 2 years ago

@kjacque -- is there more we can do or can the issue be closed?

kjacque commented 2 years ago

Okay to close.