rhkdump / kdump-utils

Kernel crash dump collection utilities
GNU General Public License v2.0
3 stars 8 forks source link

[PATCH v2]Introduce vmcore creation notification to kdump #13

Closed liutgnu closed 2 weeks ago

liutgnu commented 1 month ago
Motivation
==========

People may forget to recheck to ensure kdump works, which as a result, a
possibility of no vmcores generated after a real system crash. It is
unexpected for kdump.

It is highly recommended people to recheck kdump after any system
modification, such as:

a. after kernel patching or whole yum update, as it might break something
   on which kdump is dependent, maybe due to introduction of any new bug etc.
b. after any change at hardware level, maybe storage, networking,
   firmware upgrading etc.
c. after implementing any new application, like which involves 3rd party modules
   etc.

Though these exceed the range of kdump, however a simple vmcore creation
status notification is good to have for now.

Design
======

Kdump currently will check any relating files/fs/drivers modified before
determine if initrd should rebuild when (re)start. A rebuild is an
indicator of such modification, and kdump need to be rechecked. This will
clear the vmcore creation status specified in $VMCORE_CREATION_STATUS.

Vmcore creation check will happen at "kdumpctl (re)start/status", and will
report the creation success/fail status to users. A "success" status indicates
previously there has been a vmcore successfully generated based on the current
env, so it is more likely a vmcore will be generated later when real crash
happens; A "fail" status indicates previously there was no vmcore
generated, or has been a vmcore creation failed based on current env. User
should check the 2nd kernel log or the kexec-dmesg.log for the failing reason.

$VMCORE_CREATION_STATUS is used for recording the vmcore creation status of
the current env. The format will be like:

   success 1718682002

Which means, there has been a vmcore generated successfully at this
timestamp for the current env.

Usage
=====

[root@localhost ~]# kdumpctl restart
kdump: kexec: unloaded kdump kernel
kdump: Stopping kdump: [OK]
kdump: kexec: loaded kdump kernel
kdump: Starting kdump: [OK]
kdump: Fail: No vmcore creation test performed!

[root@localhost ~]# echo c > /proc/sysrq-trigger

[root@localhost ~]# kdumpctl status
kdump: Kdump is operational
kdump: Success: Last successful vmcore creation on Tue Jun 18 16:39:10 CST 2024

[root@localhost ~]# kdumpctl restart
kdump: kexec: unloaded kdump kernel
kdump: Stopping kdump: [OK]
kdump: kexec: loaded kdump kernel
kdump: Starting kdump: [OK]
kdump: Success: Last successful vmcore creation on Tue Jun 18 16:39:10 CST 2024

The notification for kdumpctl (re)start/status can be disabled by
setting VMCORE_CREATION_NOTIFICATION in /etc/sysconfig/kdump

Signed-off-by: Tao Liu <ltao@redhat.com>
liutgnu commented 3 weeks ago

Hi Tao,

Thanks for the effort for making this achievement. I have one tiny concern about this PR.

If vmcore was created successfully and the status is recorded. Then later kdump kernel didn't finish booting to the init process or run the kdump scripts. Will we still get a 'success' status for kdump?

Hi Baoquan,

Yes, it will get a 'success' status. Because there was a vmcore generated before, so there is a success status recorded. Since we don't make any change to the system, the status file will not be deleted, and we are expecting the same env will lead to a same result, aka a vmcore generated for a later crash. However when a later crash happens, the init doesn't bootup. This breaks our premise of "A 'success' status indicates previously there has been a vmcore successfully generated based on the current env, so it is more likely a vmcore will be generated later when real crash happens".

This is a problem and it cannot be solved by this patch, the patch is based on the premise, but the premise is failed.

Personally I suggest we give more time of testing the premise. If it cannot cover the majority proportion of our cases, there is no need to introduce this patch.

liutgnu commented 2 weeks ago

Hi Dave,

Thanks for the review.

[root@localhost ~]# kdumpctl restart kdump: kexec: unloaded kdump kernel kdump: Stopping kdump: [OK] kdump: kexec: loaded kdump kernel kdump: Starting kdump: [OK] kdump: Fail: No vmcore creation test performed!

Above "Fail" message causes misunderstandings, people may think that kdump service failed. As it is just a notification, not mandatory remove the "Fail:" and "Success:" looks reasonable. And also extra brief words to let people know how to do the test will be better. For example: kdump: it seems kdump not tested, you can trigger crash and test kdump via "echo c > /proc/sysrq-trigger" . Maybe see if Philipp has better suggestion about the wording.

I updated the notify message as:

[root@localhost ~]# kdumpctl restart
kdump: kexec: unloaded kdump kernel
kdump: Stopping kdump: [OK]
kdump: kexec: loaded kdump kernel
kdump: Starting kdump: [OK]
kdump: Vmcore creation notice: No vmcore creation test performed!

The Fail/Success message been changed to "Vmcore creation notice". Personally I think the "kdump: it seems kdump not tested, you can trigger crash and test kdump via "echo c > /proc/sysrq-trigger"" message is redundant. Because the message is only meaningful to the CEE people, so CEE people know how to handle the test IMHO. And we don't want to instruct ordinary people to follow the message, because crash & reboot the system is a dangerous operation if people have no context about this feature. So I guess one "No vmcore creation test performed!" message is enough. What do you think?

Thanks, Tao Liu

liutgnu commented 2 weeks ago

Please checkout the new version of v3 in 1. And I will close this one. Thanks!

baoquan-he commented 2 weeks ago

Hi Tao,

... Yes, it will get a 'success' status. Because there was a vmcore generated before, so there is a success status recorded. Since we don't make any change to the system, the status file will not be deleted, and we are expecting the same env will lead to a same result, aka a vmcore generated for a later crash. However when a later crash happens, the init doesn't bootup. This breaks our premise of "A 'success' status indicates previously there has been a vmcore successfully generated based on the current env, so it is more likely a vmcore will be generated later when real crash happens".

When I talked to Dave about this, an idea comes up that we can use a specific command to trigger the test, e.g kdumpctl test, then remove the file $VMCORE_CREATION_STATUS before triggering crash. Then we will only have 'success' or 'unknown'/'untested' two status. 'success' means vmcore is saved successfully last time, 'untested/unknown' means it's not tested or vmcore is not saved successfully. Not sure if this is clearer regarding to the status.

This is a problem and it cannot be solved by this patch, the patch is based on the premise, but the premise is failed.

Personally I suggest we give more time of testing the premise. If it cannot cover the majority proportion of our cases, there is no need to introduce this patch.

liutgnu commented 1 week ago

Hi Baoquan,

Hi Tao,

... Yes, it will get a 'success' status. Because there was a vmcore generated before, so there is a success status recorded. Since we don't make any change to the system, the status file will not be deleted, and we are expecting the same env will lead to a same result, aka a vmcore generated for a later crash. However when a later crash happens, the init doesn't bootup. This breaks our premise of "A 'success' status indicates previously there has been a vmcore successfully generated based on the current env, so it is more likely a vmcore will be generated later when real crash happens".

When I talked to Dave about this, an idea comes up that we can use a specific command to trigger the test, e.g kdumpctl test, then remove the file $VMCORE_CREATION_STATUS before triggering crash. Then we will only have 'success' or 'unknown'/'untested' two status. 'success' means vmcore is saved successfully last time, 'untested/unknown' means it's not tested or vmcore is not saved successfully. Not sure if this is clearer regarding to the status.

Thanks for the comments, now I have worked out the v4 which implemented the testing entrance as "kdumpctl test", which will remove $VMCORE_CREATION_STATUS before triggering crash. Please checkout the final force-pushed commits in 1.

This is a problem and it cannot be solved by this patch, the patch is based on the premise, but the premise is failed. Personally I suggest we give more time of testing the premise. If it cannot cover the majority proportion of our cases, there is no need to introduce this patch.