rhkdump / kdump-utils

Kernel crash dump collection utilities
GNU General Public License v2.0
3 stars 8 forks source link

[PATCH v1]kdumpctl: add kdump tested status updating/reporting support #8

Closed liutgnu closed 1 month ago

liutgnu commented 2 months ago

Motivation

People usually won't test if kdump can really generate a vmcore before regarding kdump as workable, which as a result, a possibility of no vmcores generated after a real system crash. It is unexpected for kdump.

Thought it is highly recommented people to test kdump after any system modification, such as:

a. after kernel patching or whole yum update, as it might break something on which kdump is dependent, maybe due to introduction of any new bug etc. b. after any change at hardware level, maybe storage, networking, firmware upgrading etc. c. after implementing any new application, like which involves 3rd party modules etc.

Though these exceed the range of kdump, however a simple test notification is good to have for now.

Design

Kdump currently will check any relating files/fs/drivers modified before determine if initrd should rebuild when (re)start. A rebuild is an indicator of modification, so kdump need to be tested. This will clear the test status specified in $KDUMP_STATUS.

Kdump test check will happen at "kdumpctl (re)start/status", and will report the tested/untested status to users. A tested status indicates previously there was a vmcore successfully generated based on the current env, so it is more likely a vmcore will be generated later when real crash happens.

$KDUMP_STATUS is used for recording the newest vmcore and the test status. The format will be like:

root@1.2.3.4:/var/crash 127.0.0.1-2024-05-01-15:54:29/vmcore 1714550071 untested

Which means, the vmcore saved at this path, with this timestamp is the newest one, and the kdump is not tested. If later another vmcore in the same path been found, with larger(newer) timestamp. The newer vmcore will be updated into $KDUMP_STATUS, and the status will be marked as tested. (Note: There is a premise the newer vmcore is generated by the current machine. If not then the kdump test status is incorrect, see the following concurrent test case:

 machine1:                   machine2:

 start test checking         start test checking
         |                          |
     V                          V
 get the newest vmcore       get the newest vmcore
         |                          |
         V                          V
 notice user untested        notice user untested
         |                          |
         V                          V
 expect user test kdump  <-- generate vmcore
 by generate a vmcore               |
         |                          V
         V                         ...
 start test checking
         |
         V
 find a newer vmcore
         |
         V
 update $KDUMP_STATUS
 and notice user tested  <-- wrong test status

In order to differentiate vmcores and corresponding machine, ip address is not reliable, like ssh dump through a NAT network. Extra code will be used to implementing this feature. Besides personally I think concurrent kdump test on multi-machines is rare. So only serial kdump test is supported for now.)

The detailed updating/checking rules can be found in check_kdump_tested().

liutgnu commented 1 month ago

I will close this MR because v2 been posted in [1]. Thanks!

[1] https://github.com/rhkdump/kdump-utils/pull/13

daveyoung commented 1 month ago

Sounds good, thanks Tao.

[test the email reply to github]

On Tue, 18 Jun 2024 at 17:28, liutgnu @.***> wrote:

Closed #8 https://github.com/rhkdump/kdump-utils/pull/8.

— Reply to this email directly, view it on GitHub https://github.com/rhkdump/kdump-utils/pull/8#event-13198180795, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABOAKTLKX7NSVERCLDGWFHDZH74VBAVCNFSM6AAAAABH7JVWLGVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJTGE4TQMJYGA3TSNI . You are receiving this because your review was requested.Message ID: @.***>