pmem / pmdk

Persistent Memory Development Kit
https://pmem.io
Other
1.34k stars 506 forks source link

an ADR failure was deteched, the pool might be corrupted #6036

Open five-mao opened 6 months ago

five-mao commented 6 months ago

When the process starts, open the file report "an ADR failure was deteched, the pool might be corrupted" through pmemobj_open. 1)What does this error mean? 2) Does it mean that there is data loss on the pmem? 3) Is there any way to ignore this err from pmem , and if so, is there a risk of data corruption?

Note: 1. The host is abnormal, and the mainboard has been replaced

  1. In the ndctl list -DH command output, the shutdown_conut field is 1, but shutdown_state is clean 3、PMDK package version(s): 1.12.1-rc1
five-mao commented 6 months ago

Add related pictures:

pmemobj_open err

ndclt list

pbalcer commented 6 months ago

The pool stores the last shutdown_count it was opened with. If the server crashes while the pool was opened, and on the next boot shutdown_count is increased, pmdk will report that as a possible ADR failure. So yes, data loss is possible.

AFAIR, $ pmempool check --repair pool.obj will prompt you whether you want to zero the shutdown count in the pool. If data loss occurred as a result of the ADR failure, then this may not be safe.

Sean58238 commented 6 months ago

Thanks Piotr, @five-mao does this answer can solve your question?

five-mao commented 6 months ago

1.Is there a detailed guide or usage example for executing the command pmempool check -- repair pool.obj? 2.What are the special precautions that need to be taken during the repair execution

发自我的iPhone

------------------ Original ------------------ From: Piotr Balcer @.> Date: Tue,Mar 5,2024 10:54 PM To: pmem/pmdk @.> Cc: five-mao @.>, Author @.> Subject: Re: [pmem/pmdk] an ADR failure was deteched, the pool might becorrupted (Issue #6036)

The pool stores the last shutdown_count it was opened with. If the server crashes while the pool was opened, and on the next boot shutdown_count is increased, pmdk will report that as a possible ADR failure. So yes, data loss is possible.

AFAIR, $ pmempool check --repair pool.obj will prompt you whether you want to zero the shutdown count in the pool. If data loss occurred as a result of the ADR failure, then this may not be safe.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

five-mao commented 6 months ago

hi. After I use the pmempool check --repair command to repair the shutdown_count and restart the node again, ADR failure still exists. Is there any way to persistently clear the ADR failure?

Sean58238 commented 6 months ago

@five-mao pmempool check --repair should zero-out related files and they should get reinitialized on next open of the pool. So it really should’ve helped. we did see issues related to shutdown state when the user had e.g., incorrect version of ndctl that didn’t match the kernel or when PMDK didn’t have permissions to read the related sysfs files.

five-mao commented 6 months ago

The version of ndctl and kernel is as follows: PMDK package version(s): 1.12.1-rc1 kernel:5.15.67-11.cl9.x86_64 #1 SMP Tue Jul 18 10:37:37 CST 2023 x86_64 x86_64 x86_64 GNU/Linux ndctl version:71.1 Are they a match?

five-mao commented 6 months ago

BTW, Is there a fault injection method that can simulate the repair function of an unsafe shutdown use "ndctl inject-smart nmem6 --unsafe-shutdown" does not work for me like blow. why? [root@node15 ~]# ndctl list -DH [ { "dev":"nmem1", "id":"8089-a2-2134-000042c6", "handle":4368, "phys_id":49, "security":"disabled", "health":{ "health_state":"ok", "temperature_celsius":45.0, "controller_temperature_celsius":44.0, "spares_percentage":100, "alarm_temperature":false, "alarm_controller_temperature":false, "alarm_spares":false, "alarm_enabled_media_temperature":false, "alarm_enabled_ctrl_temperature":false, "alarm_enabled_spares":false, "shutdown_state":"clean", "shutdown_count":0 } }, { "dev":"nmem0", "id":"8089-a2-2134-0000477f", "handle":272, "phys_id":33, "security":"disabled", "health":{ "health_state":"ok", "temperature_celsius":44.0, "controller_temperature_celsius":42.0, "spares_percentage":100, "alarm_temperature":false, "alarm_controller_temperature":false, "alarm_spares":false, "alarm_enabled_media_temperature":false, "alarm_enabled_ctrl_temperature":false, "alarm_enabled_spares":false, "shutdown_state":"clean", "shutdown_count":0 } } ] [root@node15 ~]# ndctl inject-smart nmem0 --unsafe-shutdown Error: nmem0: smart inject unsafe_shutdown command failed: No such device or address (-6)

five-mao commented 6 months ago

@five-mao pmempool check --repair should zero-out related files and they should get reinitialized on next open of the pool. So it really should’ve helped. we did see issues related to shutdown state when the user had e.g., incorrect version of ndctl that didn’t match the kernel or when PMDK didn’t have permissions to read the related sysfs files.

After the pmempool check --repair command is executed and the node is restarted (reboot cmd), the ADR Failer problem still occurs. I can see that the value of shutdown_count before the restart is 41. After the reboot is performed twice "shutdown_count" grows to 43,

  1. Does it mean that the pmempool check --repair command takes effect persistently, but a new shutdown_count is generated after each reboot?
  2. Why do I run reboot rather than server crush? shutdown_count increased?
janekmi commented 6 months ago

the ADR Failer problem still occurs. I can see that the value of shutdown_count before the restart is 41. After the reboot is performed twice "shutdown_count" grows to 43,

If I understand correctly "shutdown_count" is the number of "Unsafe Shutdowns" which means the system is still in bad shape since it can't safely shutdown. You have to look into it.

@stellarhopper please correct me if I am wrong.

Is there any way to persistently clear the ADR failure?

No. It would mean you effectively ignore the issues you have in your system.

Have you at least tried to open the pmemobj pool after pmempool check --repair? Is it ok before the reboot?

five-mao commented 6 months ago

the ADR Failer problem still occurs. I can see that the value of shutdown_count before the restart is 41. After the reboot is performed twice "shutdown_count" grows to 43,

If I understand correctly "shutdown_count" is the number of "Unsafe Shutdowns" which means the system is still in bad shape since it can't safely shutdown. You have to look into it.

@stellarhopper please correct me if I am wrong.

Is there any way to persistently clear the ADR failure?

No. It would mean you effectively ignore the issues you have in your system.

Have you at least tried to open the pmemobj pool after pmempool check --repair? Is it ok before the reboot?

1、before the reboot try to open the pmemobj pool after pmempool check --repair is ok.

2、why exec reboot command cannont safely shutdown with power holding?

janekmi commented 6 months ago

1、before the reboot try to open the pmemobj pool after pmempool check --repair is ok.

This is great!

2、why exec reboot command cannont safely shutdown with power holding?

I am afraid this is a hardware issue. Please get in touch with the tech support for further guidance.