Open insanecoderr opened 6 years ago
You can find LSS details in show -a -dimm under: LastShutdownStatus and LastShutdownTime You can also find AitDramEnabled in the command AitDram is also one of the possible HealthStateReasons if the DIMM isn't healthy in show -dimm
Other SMART health info can be found using show -sensors. Health, MediaTemperature, ControllerTemperature, PercentageRemaining, DirtyShutdowns, PowerOnTime, UpTime, PowerCycles, FwErrorCount, UnlatchedDirtyShutdowns
Should be 'ipmctl set -dimm 0x0001 Clear=1 Poison=[address]'
Thanks for your reply.It works.
1.Moreover,what does the lastshutdown status mean? the output is "PM S5, PMIC Power Loss,FW Flush Complete,Write Data Flush Complete,PM Idle". Could u pls tell me what does each parameter of the "lastshutdownstatus" represent?How can I judghe whether the shutdown was clean?
2.'ipmctl set -dimm 0x0001 Clear=1 Poison=[address]',is the [address] SPA of NVDIMM?And how can i determine the location of poison?
3.Can ipmctl get the ADR status about whether ADR is complete or not?
Those fields are described in the FW interface spec from intel. They should also be available in the the testing branch when I publish the manpages later today.
Whether or not the shutdown was clean can be found with show -sensors with the DirtyShutdown (latched) and UnlatchedShutdown Counts
I was thinking in the context of error injection in which you select the address to poison. @stellarhopper Could you help answer this?
3.How to clear the poison indicator? Knowing the location of poison indicator,how can i clear the poison indicator? Does ipmctl support this or i should use raw mailbox command to do this?
Regarding 2, - you can determine the poison location by doing:
ndctl list -M --namespace=<namespaceX.Y>
This will return something like:
# ndctl list -M --namespace=namespace4.0
[
{
"dev":"namespace4.0",
"mode":"fsdax",
"map":"dev",
"size":63508480,
"uuid":"8f3a0d0d-9869-4a07-82c8-61b4e7c6d79b",
"blockdev":"pmem4",
"badblock_count":1,
"badblocks":[
{
"offset":32,
"length":1
}
]
}
]
The badblocks section there contains the block offset in the namespace (always in terms of 512B blocks), and the number of blocks. You can multiply those by 512 to get the byte offset within the namespace. If you wanted to use the ipmctl command to clear the errors at this point, you will have to add the namespace byte offset to the namespace base SPA (can be obtained from the 'resource' in sysfs, or corresponding libndctl api (root only), and add the offset to that.
Alternatively, you can also clear errors by doing block sized/aligned odirect writes to the block device. For example:
# dd if=/dev/zero of=/dev/pmem4 seek=32 bs=512 count=1 oflag=direct
1+0 records in
1+0 records out
512 bytes copied, 0.00101778 s, 503 kB/s
# ndctl list -M --namespace=namespace4.0
[
{
"dev":"namespace4.0",
"mode":"fsdax",
"map":"dev",
"size":63508480,
"uuid":"8f3a0d0d-9869-4a07-82c8-61b4e7c6d79b",
"blockdev":"pmem4"
}
]
@juston-li
@stellarhopper regarding 2, if i am aware of the SPA of bad blocks through Cscript which is an error injection tool provided by intel,how many poison indicators will be cleared by a single ipmctl cmd in a time?As the mailbox cmd will clear 4 poison indicators at a time,i assume that the ipmctl would act the same.
@insanecoderr I'm not familiar with the Cscript tool, but the ars_cap command includes a clear_err_unit field, which describes how many bytes of poison can be cleared with one command. For clearing through the pmem drivers using 'dd', the clear-error is repeated as needed to clear a full sector.
I expect the ipmctl version does something similar. The clear_err_unit is typically 256B, so assuming a 'poison indicator' is a cache line (64B), 4 units makes sense.
i will try some experiment later,thanks.
Can I judge whether the shutdown was clean only according to the "lastshutdownstatus"?
Possibly, It's just a bit complicated as there's a number of LSS details and combinations that would indicate clean or dirty. We'll look into making it simpler to determine if the shutdown was clean.
For now, using the DirtyShutdownCounts and checking for increments is probably easiest
got it
@juston-li
which version of ipmctl support"show -sensors with the UnlatchedDirtyShutdown"?
The commit is in https://github.com/intel/ipmctl/commit/6edf6800ec99e2290488bf5dbf1c787d011d1ecb which hasn't been released to master. Likely to be released next week.
You can see in the commit though that on older versions DirtyShutdowns was actually reporting UnlatchedDirtyShutdonws, not latched. That commit adds latched and differentiates between the two.
@stellarhopper failed to clear the poison indicator using SPA calculated from “ndctl list -M”:“Clear injected poison of address (0x...)on DIMM 0x0001:Error(3) - Command failure”
@stellarhopper caculation is :512*offset+namespace_start_SPA. Maybe i didnt get the right namespace_start_SPA.Can you show me how to get this in detail?thanks
Is this still an issue in the current build?
Some questions for ipmctl:
1.Can ipmctl show LSS and AIT DRAM Status? As part of SMART Health info,LSS and AIT DRAM Status are important parameters. However, ipmctl seems unable to show these info.Will it be available in newer versions to come?Or i missed it?
2.Can ipmctl show detailed SMART Health info? Detailed SMART info would help problem tracking,so i think it is meaningful.
3.How to clear the poison indicator? Knowing the location of poison indicator,how can i clear the poison indicator? Does ipmctl support this or i should use raw mailbox command to do this?