Open five-mao opened 6 months ago
Add related pictures:
The pool stores the last shutdown_count it was opened with. If the server crashes while the pool was opened, and on the next boot shutdown_count is increased, pmdk will report that as a possible ADR failure. So yes, data loss is possible.
AFAIR, $ pmempool check --repair pool.obj
will prompt you whether you want to zero the shutdown count in the pool. If data loss occurred as a result of the ADR failure, then this may not be safe.
Thanks Piotr, @five-mao does this answer can solve your question?
1.Is there a detailed guide or usage example for executing the command pmempool check -- repair pool.obj? 2.What are the special precautions that need to be taken during the repair execution
发自我的iPhone
------------------ Original ------------------ From: Piotr Balcer @.> Date: Tue,Mar 5,2024 10:54 PM To: pmem/pmdk @.> Cc: five-mao @.>, Author @.> Subject: Re: [pmem/pmdk] an ADR failure was deteched, the pool might becorrupted (Issue #6036)
The pool stores the last shutdown_count it was opened with. If the server crashes while the pool was opened, and on the next boot shutdown_count is increased, pmdk will report that as a possible ADR failure. So yes, data loss is possible.
AFAIR, $ pmempool check --repair pool.obj will prompt you whether you want to zero the shutdown count in the pool. If data loss occurred as a result of the ADR failure, then this may not be safe.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
hi. After I use the pmempool check --repair command to repair the shutdown_count and restart the node again, ADR failure still exists. Is there any way to persistently clear the ADR failure?
@five-mao pmempool check --repair should zero-out related files and they should get reinitialized on next open of the pool. So it really should’ve helped. we did see issues related to shutdown state when the user had e.g., incorrect version of ndctl that didn’t match the kernel or when PMDK didn’t have permissions to read the related sysfs files.
The version of ndctl and kernel is as follows: PMDK package version(s): 1.12.1-rc1 kernel:5.15.67-11.cl9.x86_64 #1 SMP Tue Jul 18 10:37:37 CST 2023 x86_64 x86_64 x86_64 GNU/Linux ndctl version:71.1 Are they a match?
BTW, Is there a fault injection method that can simulate the repair function of an unsafe shutdown use "ndctl inject-smart nmem6 --unsafe-shutdown" does not work for me like blow. why? [root@node15 ~]# ndctl list -DH [ { "dev":"nmem1", "id":"8089-a2-2134-000042c6", "handle":4368, "phys_id":49, "security":"disabled", "health":{ "health_state":"ok", "temperature_celsius":45.0, "controller_temperature_celsius":44.0, "spares_percentage":100, "alarm_temperature":false, "alarm_controller_temperature":false, "alarm_spares":false, "alarm_enabled_media_temperature":false, "alarm_enabled_ctrl_temperature":false, "alarm_enabled_spares":false, "shutdown_state":"clean", "shutdown_count":0 } }, { "dev":"nmem0", "id":"8089-a2-2134-0000477f", "handle":272, "phys_id":33, "security":"disabled", "health":{ "health_state":"ok", "temperature_celsius":44.0, "controller_temperature_celsius":42.0, "spares_percentage":100, "alarm_temperature":false, "alarm_controller_temperature":false, "alarm_spares":false, "alarm_enabled_media_temperature":false, "alarm_enabled_ctrl_temperature":false, "alarm_enabled_spares":false, "shutdown_state":"clean", "shutdown_count":0 } } ] [root@node15 ~]# ndctl inject-smart nmem0 --unsafe-shutdown Error: nmem0: smart inject unsafe_shutdown command failed: No such device or address (-6)
@five-mao pmempool check --repair should zero-out related files and they should get reinitialized on next open of the pool. So it really should’ve helped. we did see issues related to shutdown state when the user had e.g., incorrect version of ndctl that didn’t match the kernel or when PMDK didn’t have permissions to read the related sysfs files.
After the pmempool check --repair command is executed and the node is restarted (reboot cmd), the ADR Failer problem still occurs. I can see that the value of shutdown_count before the restart is 41. After the reboot is performed twice "shutdown_count" grows to 43,
the ADR Failer problem still occurs. I can see that the value of shutdown_count before the restart is 41. After the reboot is performed twice "shutdown_count" grows to 43,
If I understand correctly "shutdown_count" is the number of "Unsafe Shutdowns" which means the system is still in bad shape since it can't safely shutdown. You have to look into it.
@stellarhopper please correct me if I am wrong.
Is there any way to persistently clear the ADR failure?
No. It would mean you effectively ignore the issues you have in your system.
Have you at least tried to open the pmemobj pool after pmempool check --repair
? Is it ok before the reboot?
the ADR Failer problem still occurs. I can see that the value of shutdown_count before the restart is 41. After the reboot is performed twice "shutdown_count" grows to 43,
If I understand correctly "shutdown_count" is the number of "Unsafe Shutdowns" which means the system is still in bad shape since it can't safely shutdown. You have to look into it.
@stellarhopper please correct me if I am wrong.
Is there any way to persistently clear the ADR failure?
No. It would mean you effectively ignore the issues you have in your system.
Have you at least tried to open the pmemobj pool after
pmempool check --repair
? Is it ok before the reboot?
1、before the reboot try to open the pmemobj pool after pmempool check --repair is ok.
2、why exec reboot command cannont safely shutdown with power holding?
1、before the reboot try to open the pmemobj pool after pmempool check --repair is ok.
This is great!
2、why exec reboot command cannont safely shutdown with power holding?
I am afraid this is a hardware issue. Please get in touch with the tech support for further guidance.
When the process starts, open the file report "an ADR failure was deteched, the pool might be corrupted" through pmemobj_open. 1)What does this error mean? 2) Does it mean that there is data loss on the pmem? 3) Is there any way to ignore this err from pmem , and if so, is there a risk of data corruption?
Note: 1. The host is abnormal, and the mainboard has been replaced