intel / ipmctl

BSD 3-Clause "New" or "Revised" License
183 stars 62 forks source link

Firmware update stuck #183

Open dbshch opened 2 years ago

dbshch commented 2 years ago

I met some severe problems when using DCPMM. So I'm trying to update the firmware of the PMs. But the updating process has been stuck for 4 hours and is still stuck now. The issue seems like the issue #130 , but I think I'm using the latest ipmctl.

Server: Huawei 2288h v5 BIOS: the latest v799 Firmware: from 01.02.00.5355 to 5417 OS: ubuntu 21.04 ipmctl: from ubuntu repo, I think it is 02.00.00.3852. "ipmctl version" command also stuck now.

The firmware is downloaded from Huawei's support site. And it is released on the same date as the v799 BIOS release, so I think the BIOS should work with this firmware.

Now executing every ipmctl command will also be stuck (but ndctl commands work). Even executing "ipmctl version -v" will show logs like this repeatedly every second:

NVM_DBG_LOGGER NVDIMM-VERB:Exiting Dimm.c::FwCmdIdDimm(): 0x0 NVM_DBG_LOGGER NVDIMM-VERB:Entering NvmDimmConfig.c::SetFisTransportAttributes() NVM_DBG_LOGGER NVDIMM-VERB:Exiting NvmDimmConfig.c::SetFisTransportAttributes(): 0x0 NVM_DBG_LOGGER NVDIMM-VERB:Entering Dimm.c::PopulateDimmBsrAndBootStatusBitmask() NVM_DBG_LOGGER NVDIMM-VERB:Entering Dimm.c::FwCmdGetBsr() NVM_DBG_LOGGER NVDIMM-VERB:Entering Utility.c::OpenNvmDimmProtocol() NVM_DBG_LOGGER NVDIMM-VERB:Entering Utility.c::GetDriverHandle() NVM_DBG_LOGGER NVDIMM-VERB:Exiting Utility.c::GetDriverHandle(): 0x0 NVM_DBG_LOGGER NVDIMM-VERB:Entering Utility.c::CheckConfigProtocolVersion() NVM_DBG_LOGGER NVDIMM-VERB:Exiting Utility.c::CheckConfigProtocolVersion(): 0x0 NVM_DBG_LOGGER NVDIMM-VERB:Exiting Utility.c::OpenNvmDimmProtocol(): 0x0 NVM_DBG_LOGGER NVDIMM-VERB:Entering NvmDimmConfig.c::GetFisTransportAttributes() NVM_DBG_LOGGER NVDIMM-VERB:Exiting NvmDimmConfig.c::GetFisTransportAttributes(): 0x0 NVM_DBG_LOGGER NVDIMM-DBG:Dimm.c::PassThru:7337: Calling 0xfd:0x3 over ddrt sp on DCPMM 0x101

Can I interrupt/kill the updating process and try updating with ndctl?

On the other hand, I met some severe problems when using DCPMM. I'm not sure are these problems related to this issue. The server reports "System memory MRC fatal error detected". In recent 2 days, the data wrote to the PM (fsdax mounted) are lost after the server restarts. I also met some very strange performance behaviors but I'm not sure whether they are due to the server and PM problems or they are expected behaviors.

dbshch commented 2 years ago

The process finished after 5 hours for 4x128G PM. Is this a normal behavior?

nolanhergert commented 2 years ago

You're right, it looks exactly like that other issue.

Yeah, see if ndctl has the same behavior. It's using a slightly different pathway, so it might perform a lot better.

I think your particular BIOS implementation is running our payload transactions in SMM mode for some reason, which is subject to throttling. 5 hours is about right, if you assume one second per 64 bytes for a ~300KB firmware image. If you force it to use large payload mailbox, you'll still have that 1 second penalty but you'll get in a lot more data per 1s and it should complete much faster.

ipmctl load -ddrt -lpmb -v -source <fw.bin> -dimm

Let me know what you find out. We didn't default to this behavior because our reference BIOS didn't throttle ddrt small payload transactions, so they completed a few seconds faster than using large payload.

Maybe start a discussion with Huawei and ask them to check their implementation relative to Intel's reference BIOS in this regard?

As to your other questions, @sscargal might have some better insight.

sscargal commented 2 years ago

@dbshch I agree with Nolan that opening a support ticker with Huawei is the correct next step. We can't provide ODM/OEM support through this GitHub channel.

I would start by understanding and resolving the MRC issues first to eliminate the hardware issues. Then you can look at the performance issue(s). There are specific Intel Optane Persistent Memory support channels that can provide more general support than this ipmctl tool specific GitHub community, though OEM/ODM specific issues need to be addressed directly with the vendor.