intel / ipmctl

BSD 3-Clause "New" or "Revised" License
184 stars 62 forks source link

Can Ipmctl show LSS and AIT DRAM Status? #33

Open insanecoderr opened 6 years ago

insanecoderr commented 6 years ago

Some questions for ipmctl:

1.Can ipmctl show LSS and AIT DRAM Status? As part of SMART Health info,LSS and AIT DRAM Status are important parameters. However, ipmctl seems unable to show these info.Will it be available in newer versions to come?Or i missed it?

2.Can ipmctl show detailed SMART Health info? Detailed SMART info would help problem tracking,so i think it is meaningful.

3.How to clear the poison indicator? Knowing the location of poison indicator,how can i clear the poison indicator? Does ipmctl support this or i should use raw mailbox command to do this?

juston-li commented 6 years ago
  1. You can find LSS details in show -a -dimm under: LastShutdownStatus and LastShutdownTime You can also find AitDramEnabled in the command AitDram is also one of the possible HealthStateReasons if the DIMM isn't healthy in show -dimm

  2. Other SMART health info can be found using show -sensors. Health, MediaTemperature, ControllerTemperature, PercentageRemaining, DirtyShutdowns, PowerOnTime, UpTime, PowerCycles, FwErrorCount, UnlatchedDirtyShutdowns

  3. Should be 'ipmctl set -dimm 0x0001 Clear=1 Poison=[address]'

insanecoderr commented 6 years ago

Thanks for your reply.It works.

1.Moreover,what does the lastshutdown status mean? the output is "PM S5, PMIC Power Loss,FW Flush Complete,Write Data Flush Complete,PM Idle". Could u pls tell me what does each parameter of the "lastshutdownstatus" represent?How can I judghe whether the shutdown was clean?

2.'ipmctl set -dimm 0x0001 Clear=1 Poison=[address]',is the [address] SPA of NVDIMM?And how can i determine the location of poison?

3.Can ipmctl get the ADR status about whether ADR is complete or not?

juston-li commented 6 years ago
  1. Those fields are described in the FW interface spec from intel. They should also be available in the the testing branch when I publish the manpages later today.

    Whether or not the shutdown was clean can be found with show -sensors with the DirtyShutdown (latched) and UnlatchedShutdown Counts

  2. I was thinking in the context of error injection in which you select the address to poison. @stellarhopper Could you help answer this?

3.How to clear the poison indicator? Knowing the location of poison indicator,how can i clear the poison indicator? Does ipmctl support this or i should use raw mailbox command to do this?

  1. In the LSS details there's PM ADR Command Received. I believe that just means fw was notified about ADR though. You can again consult the DirtyShutdown count though, the count would increment if ADR failed.
stellarhopper commented 6 years ago

Regarding 2, - you can determine the poison location by doing: ndctl list -M --namespace=<namespaceX.Y> This will return something like:

# ndctl list -M --namespace=namespace4.0
[
  {
    "dev":"namespace4.0",
    "mode":"fsdax",
    "map":"dev",
    "size":63508480,
    "uuid":"8f3a0d0d-9869-4a07-82c8-61b4e7c6d79b",
    "blockdev":"pmem4",
    "badblock_count":1,
    "badblocks":[
      {
        "offset":32,
        "length":1
      }
    ]
  }
]

The badblocks section there contains the block offset in the namespace (always in terms of 512B blocks), and the number of blocks. You can multiply those by 512 to get the byte offset within the namespace. If you wanted to use the ipmctl command to clear the errors at this point, you will have to add the namespace byte offset to the namespace base SPA (can be obtained from the 'resource' in sysfs, or corresponding libndctl api (root only), and add the offset to that.

Alternatively, you can also clear errors by doing block sized/aligned odirect writes to the block device. For example:

# dd if=/dev/zero of=/dev/pmem4 seek=32 bs=512 count=1 oflag=direct
1+0 records in
1+0 records out
512 bytes copied, 0.00101778 s, 503 kB/s

# ndctl list -M --namespace=namespace4.0
[
  {
    "dev":"namespace4.0",
    "mode":"fsdax",
    "map":"dev",
    "size":63508480,
    "uuid":"8f3a0d0d-9869-4a07-82c8-61b4e7c6d79b",
    "blockdev":"pmem4"
  }
]
insanecoderr commented 6 years ago

@juston-li

  1. Can I judge whether the shutdown was clean only according to the "lastshutdownstatus"?

@stellarhopper regarding 2, if i am aware of the SPA of bad blocks through Cscript which is an error injection tool provided by intel,how many poison indicators will be cleared by a single ipmctl cmd in a time?As the mailbox cmd will clear 4 poison indicators at a time,i assume that the ipmctl would act the same.

stellarhopper commented 6 years ago

@insanecoderr I'm not familiar with the Cscript tool, but the ars_cap command includes a clear_err_unit field, which describes how many bytes of poison can be cleared with one command. For clearing through the pmem drivers using 'dd', the clear-error is repeated as needed to clear a full sector.

I expect the ipmctl version does something similar. The clear_err_unit is typically 256B, so assuming a 'poison indicator' is a cache line (64B), 4 units makes sense.

insanecoderr commented 6 years ago

i will try some experiment later,thanks.

Can I judge whether the shutdown was clean only according to the "lastshutdownstatus"?

juston-li commented 6 years ago

Possibly, It's just a bit complicated as there's a number of LSS details and combinations that would indicate clean or dirty. We'll look into making it simpler to determine if the shutdown was clean.

For now, using the DirtyShutdownCounts and checking for increments is probably easiest

insanecoderr commented 6 years ago

got it

insanecoderr commented 6 years ago

@juston-li

which version of ipmctl support"show -sensors with the UnlatchedDirtyShutdown"?

juston-li commented 6 years ago

The commit is in https://github.com/intel/ipmctl/commit/6edf6800ec99e2290488bf5dbf1c787d011d1ecb which hasn't been released to master. Likely to be released next week.

You can see in the commit though that on older versions DirtyShutdowns was actually reporting UnlatchedDirtyShutdonws, not latched. That commit adds latched and differentiates between the two.

insanecoderr commented 6 years ago

@stellarhopper failed to clear the poison indicator using SPA calculated from “ndctl list -M”:“Clear injected poison of address (0x...)on DIMM 0x0001:Error(3) - Command failure”

insanecoderr commented 6 years ago

@stellarhopper caculation is :512*offset+namespace_start_SPA. Maybe i didnt get the right namespace_start_SPA.Can you show me how to get this in detail?thanks

gldiviney commented 5 years ago

Is this still an issue in the current build?