intel / ipmctl

BSD 3-Clause "New" or "Revised" License
183 stars 62 forks source link

I can't create provisioning goal #193

Closed zktsd813 closed 2 years ago

zktsd813 commented 2 years ago

With newest ipmctl version, I tried to provision our optane with Appdirect mode. But, it doesn't work.

#ipmctl show -memoryresources
 MemoryType   | DDR         | PMemModule  | Total
 Volatile     | 320.000 GiB | 0.000 GiB   | 320.000 GiB
 AppDirect    | -           | 0.000 GiB   | 0.000 GiB
 Cache        | 0.000 GiB   | -           | 0.000 GiB
 Inaccessible | 0.000 GiB   | 506.969 GiB | 506.969 GiB
 Physical     | 320.000 GiB | 506.969 GiB | 826.969 GiB

#ipmctl create -goal PersistentMemoryType=AppDirect
The following configuration will be applied:
 SocketID | DimmID | MemorySize | AppDirect1Size | AppDirect2Size
==================================================================
 0x0000   | 0x0010 | 0.000 GiB  | 126.000 GiB    | 0.000 GiB
 0x0000   | 0x0110 | 0.000 GiB  | 126.000 GiB    | 0.000 GiB
 0x0000   | 0x0210 | 0.000 GiB  | 126.000 GiB    | 0.000 GiB
 0x0000   | 0x0310 | 0.000 GiB  | 126.000 GiB    | 0.000 GiB
y
Created following region configuration goal
 SocketID | DimmID | MemorySize | AppDirect1Size | AppDirect2Size
==================================================================
 0x0000   | 0x0010 | 0.000 GiB  | 126.000 GiB    | 0.000 GiB
 0x0000   | 0x0110 | 0.000 GiB  | 126.000 GiB    | 0.000 GiB
 0x0000   | 0x0210 | 0.000 GiB  | 126.000 GiB    | 0.000 GiB
 0x0000   | 0x0310 | 0.000 GiB  | 126.000 GiB    | 0.000 GiB
A reboot is required to process new memory allocation goals.

After reboot

There is no change in memory resources

# ipmctl show -system pcat
   CreatorRevision: 0x20091013
   ---TableType=0x0
      Length: 16 bytes
      TypeEquals: PlatformCapabilityInfoTable
      PMemModuleMgmtSWConfigInputSupport: 0x1 (Yes)
      MemoryModeCapabilities: 0x7 (1LM, 2LM, AppDirect)
      CurrentMemoryMode: 0x10
         -Current Volatile Memory Mode: 1LM
         -Allowed Persistent Memory Mode: None
         -Allowed Volatile Memory Mode: 1LM or 2LM
      MaxPMInterleaveSets: 0x28
         -Per CPU Die: 0x8
         -Per PMem module: 0x2

OS : Ubuntu 20.04.2 LTS CPU : Intel(R) Xeon(R) Gold 5317 CPU @ 3.00GHz two socket system.

It seems that there is no allowed persistent memory mode. How can I fix this?

nolanhergert commented 2 years ago

Interesting, I've never seen that before! Since ipmctl is allowing you to create the goal, the PCAT table value of "None" looks a little funny but is not actually limiting you. It is "AppDirect" on my system.

I would run ipmctl start -diagnostic and potentially ipmctl show -pcd and see what they say about why BIOS is not provisioning those modules.

My guess is that there's a BIOS setting you need to change to allow 1LM provisioning or you don't have the modules in a POR configuration. There might be a knob for the latter located at "Socket Configuration → Memory Configuration → Enforce Population POR", but likely in both cases you'll need to ask your hardware vendor for assistance. Let me know what you find out!

sscargal commented 2 years ago

@zktsd813 What OEM/ODM server are you using?

For a two socket system, I would expect the PMem modules to be physically installed on both sockets, two on each socket. The output from creating the goal shows all four PMem modules are listed on Socket0 only, so as Nolan alluded to this could be outside the validated configuration and as such, the BIOS may refuse to train the memory correctly. If true, you should see an error/message early in POST and/or in the platform manager logs (BMC, iDRAC, iLO, etc).

I also see the PMem is "Inaccessible" which is similar to the issue discussed in #153. There's a recommendation/suggested action in the last note of that issue from spawnflagger. See if that helps.

zktsd813 commented 2 years ago

@nolanhergert Thanks, I have checked out PMem, Out PMem pass all the test

--Test = Quick
   State = Ok
   Message = The quick health check succeeded.
   --SubTest = Manageability
      State = Ok
   --SubTest = Boot status
      State = Ok
   --SubTest = Health
      State = Ok
      Message.1 = The quick health check detected that the platform FW did not map a region to SPA on PMem module 0x0010. ACPI NFIT NVDIMM State Flags Error Bit 6 Set
      Message.2 = The quick health check detected that the platform FW did not map a region to SPA on PMem module 0x0110. ACPI NFIT NVDIMM State Flags Error Bit 6 Set
      Message.3 = The quick health check detected that the platform FW did not map a region to SPA on PMem module 0x0210. ACPI NFIT NVDIMM State Flags Error Bit 6 Set
      Message.4 = The quick health check detected that the platform FW did not map a region to SPA on PMem module 0x0310. ACPI NFIT NVDIMM State Flags Error Bit 6 Set

--Test = Config
   State = Ok
   Message = The platform configuration check succeeded.
   --SubTest = PMem module specs
      State = Ok
   --SubTest = Duplicate PMem module
      State = Ok
   --SubTest = System Capability
      State = Ok
   --SubTest = Namespace LSA
      State = Ok
   --SubTest = PCD
      State = Ok

--Test = Security
   State = Ok
   Message = The security check succeeded.
   --SubTest = Encryption status
      State = Ok
   --SubTest = Inconsistency
      State = Ok

--Test = FW
   State = Ok 
   Message = The firmware consistency and settings check succeeded.
   --SubTest = FW Consistency
      State = Ok
   --SubTest = Viral Policy
      State = Ok
   --SubTest = Threshold check
      State = Ok
   --SubTest = System Time
      State = Ok

Also, I have checked BIOS and It shows that option Enforce POR is enabled.

nolanhergert commented 2 years ago

I would try disabling the Enforce POR knob if you haven't already and see if that fixes your issue. If not, then maybe you need to enable BIOS logging and see what shows up.

sscargal commented 2 years ago

I agree with Nolan that this is likely to be a BIOS setting. One such setting to check is Advanced -> Memory Configuration -> Volatile Memory Mode = 1LM/2LM/Auto. You want to set this to 'Auto'. If it's currently set to 2LM (Memory Mode), the BIOS will enforce this configuration regardless of what configuration is written to the PMem modules, ie: what you requested with ipmctl create -goal ...

Volatile Memory Mode
Value: 1LM/2LM/Auto
Help Text: Selects whether 1LM or 2LM memory mode. If 2LM Volatile Memory Mode, BIOS will try to configure 2LM but if BIOS is unable to configure 2LM, volatile memory mode will fall back to 1LM. 1LM+2LM will enable the 'DDR Cache' option. When 1LM + 2LM option is selected, the UEFI FW will use the DDR Cache Size option to determine the DDR Cache Side for each populated channel. Any remaining DDR will be mapped as 1LM memory. 

You could reset the BIOS to factory defaults which should allow the BIOS to read the implement the goal configuration written to the PMem modules.

Would you mind running my pmemchk tool to see if it detects anything? At a minimum, it'll collect some data we can use to help troubleshoot. Though it does not collect BIOS information.

$ git clone https://github.com/sscargal/pmemchk
$ cd pmemchk
$ sudo ./pmemchk

This will collect data and analyze it. The collected data will be written to a new directory and the output from the analyzer will show PASS | FAIL | INFO message to STDOUT. An example is in the README. If you encounter issues or errors, please report them.

You'll need to tar.gz the output directory and attach it to this issue, please. Other than 'messages', there should be no user-identifiable data collected.

zktsd813 commented 2 years ago

Thank you for all. I fixed this issue by reset BIOS setting to factory default.