Introduce vmcore creation notification to kdump

rhkdump / kdump-utils

Kernel crash dump collection utilities

GNU General Public License v2.0

3 stars 12 forks source link

Introduce vmcore creation notification to kdump #20

Closed liutgnu closed 1 month ago

liutgnu commented 4 months ago

Introduce vmcore creation notification to kdump

Motivation
==========

People may forget to recheck to ensure kdump works, which as a result, a
possibility of no vmcores generated after a real system crash. It is
unexpected for kdump.

It is highly recommended people to recheck kdump after any system
modification, such as:

a. after kernel patching or whole yum update, as it might break something
   on which kdump is dependent, maybe due to introduction of any new bug etc.
b. after any change at hardware level, maybe storage, networking,
   firmware upgrading etc.
c. after implementing any new application, like which involves 3rd party modules
   etc.

Though these exceed the range of kdump, however a simple vmcore creation
status notification is good to have for now.

Design
======

Kdump currently will check any relating files/fs/drivers modified before
determine if initrd should rebuild when (re)start. A rebuild is an
indicator of such modification, and kdump need to be rechecked. This will
clear the vmcore creation status specified in $VMCORE_CREATION_STATUS.

Vmcore creation check will happen at "kdumpctl (re)start/status", and will
report the creation success/fail status to users. A "success" status indicates
previously there has been a vmcore successfully generated based on the current
env, so it is more likely a vmcore will be generated later when real crash
happens; A "fail" status indicates previously there was no vmcore
generated, or has been a vmcore creation failed based on current env. User
should check the 2nd kernel log or the kexec-dmesg.log for the failing reason.

$VMCORE_CREATION_STATUS is used for recording the vmcore creation status of
the current env. The format will be like:

   success 1718682002

Which means, there has been a vmcore generated successfully at this
timestamp for the current env.

Usage
=====

[root@localhost ~]# kdumpctl restart
kdump: kexec: unloaded kdump kernel
kdump: Stopping kdump: [OK]
kdump: kexec: loaded kdump kernel
kdump: Starting kdump: [OK]
kdump: Notice: No vmcore creation test performed!

[root@localhost ~]# kdumpctl test

[root@localhost ~]# kdumpctl status
kdump: Kdump is operational
kdump: Notice: Last successful vmcore creation on Tue Jun 18 16:39:10 CST 2024

[root@localhost ~]# kdumpctl restart
kdump: kexec: unloaded kdump kernel
kdump: Stopping kdump: [OK]
kdump: kexec: loaded kdump kernel
kdump: Starting kdump: [OK]
kdump: Notice: Last successful vmcore creation on Tue Jun 18 16:39:10 CST 2024

The notification for kdumpctl (re)start/status can be disabled by
setting VMCORE_CREATION_NOTIFICATION in /etc/sysconfig/kdump

===

v3 -> v2:
Always mount
$VMCORE_CREATION_STATUS(/var/crash/vmcore-creation.status)'s device for
2nd kernel, in case /var is a seperate device than rootfs's device.

v4 -> v3:
Add "kdumpctl test" as the entrance for performing the kdump test.

v5 -> v4:
Fix the mounting failure issue in fadump.

v6 -> v5:
Add new argument as customized mount point for add_mount/to_mount.

v7 -> v6:
a. Code refactoring based on Philipp's suggestion.
b. Only mount $VMCORE_CREATION_STATUS(/var/crash/vmcore-creation.status)'s
   device when needed.
c. Add "--force" option for "kdumpctl test", to support the automation test
   script QE may perform.
d. Add check in "kdumpctl test" that $VMCORE_CREATION_STATUS can only be on
   local drive.

v8 -> v7:
a. Rebased the patch on top of upstream commit e2b8463b5.
b. Code refactoring based on Philipp's suggestion.
c. Updated the "test" entry of kdumpctl.8.

Signed-off-by: Tao Liu <ltao@redhat.com>

daveyoung commented 4 months ago

Hi Tao, it is weird that the email notification still keeps using old description success/fail sentences. though it is V3 so no way to reply inline. Let me put a comment on web ui here: "kdump: Vmcore creation notice: No vmcore creation test performed!" For above notice, it looks redundant with two-time mentioning of "vmcore creation". I suggest just "notice: no vmcore creation test preformed!" Or "Warning: no vmcore creation test performed"

And "notice: Last successful vmcore creation on Tue Jun ..." or "Last successful vmcore creation on Tue Jun ..."

Also how about change "vmcore creation" to "kdump test", ditto about the status file name. It sounds more straightforward and easier to understand.

daveyoung commented 4 months ago

Hi Tao, otherwise I noticed that there is a special case for dump_to_rootfs in mkdumprd. So doing some testing and verification with fadump should be also necessary.

liutgnu commented 4 months ago

Hi Tao, it is weird that the email notification still keeps using old description success/fail sentences. though it is V3 so no way to reply inline. Let me put a comment on web ui here: "kdump: Vmcore creation notice: No vmcore creation test performed!" For above notice, it looks redundant with two-time mentioning of "vmcore creation". I suggest just "notice: no vmcore creation test preformed!" Or "Warning: no vmcore creation test performed"

And "notice: Last successful vmcore creation on Tue Jun ..." or "Last successful vmcore creation on Tue Jun ..."

Also how about change "vmcore creation" to "kdump test", ditto about the status file name. It sounds more straightforward and easier to understand.

Hi Dave,

Agreed, just notice: Last successful vmcore .... is enough, will get it improved in v4.

liutgnu commented 4 months ago

Hi Tao, otherwise I noticed that there is a special case for dump_to_rootfs in mkdumprd. So doing some testing and verification with fadump should be also necessary.

Yes, will do it, currently there is a OOM issue in kexec-tools which blocked my fadump testing, see RHEL-35516.

coiby commented 4 months ago

Hi @liutgnu,

I'm curious to ask why you prefer creating a new PR for a new version? For the perspective of maintainer, one benefit of pushing new version to the same PR is I can compare the difference between two versions with one-lick i.e. by clicking the "Compare" hyperlink, Screenshot_2024-07-02_15-54-56

liutgnu commented 4 months ago

Hi @liutgnu,

I'm curious to ask why you prefer creating a new PR for a new version? For the perspective of maintainer, one benefit of pushing new version to the same PR is I can compare the difference between two versions with one-lick i.e. by clicking the "Compare" hyperlink,

Please tell me which way you prefer. On one hand, the "force push" is dangerous and should be avoided, I guess a seperate MR of each patch version(pushing from different branch) is a safer one. On the other hand, "force push" against a single MR as you described, may have the following issue:

1: patch v1 (MR created)	V 2: force push1 (to improve v1)

V 3: force push2 (to improve v1 again, now the v1 is OK for review)	V 4: force push3 (Received comments, drafting v2)

V 5: force push4 (to improve v2, now v2 is ready for review) | V ....

You see, there might be plenty of "force pushes". I guess what you want to view is the diff of 3 & 5. I don't know if you can select the commit hash for diff comparison. What I know is it can only the diff of adjorning commits, please correct me if I'm wrong.

coiby commented 4 months ago

Thanks for raising the concerns! If you are worried that a force-push may overwrite the history of this repo, then you needn't to worry about it because this repo doesn't allow force-push which is also the default setting.

As for the case of comparing versions of 3 & 5, you can simply send new version of patches in one single force-push to avoid this case. Btw, you can also compare different commits directly e.g. compare your v3 with v2.

So I prefer sending new version of patches to the same PR. Another benefit is we can the same context in the same PR.

liutgnu commented 4 months ago

The v4 has been force pushed to this PR, which mainly introduced the "kdumpctl test" as the entrance for performing kdump test. @daveyoung @baoquan-he

In addition, in v3 Dave mentioned the combination of /kdumproot and /sysroot in kdump-lib.sh:get_kdump_mntpoint_from_target(), in order to simplify the dir mounting for saving $VMCORE_CREATION_STATUS. However I made an experiment showing the negative results, see the following:

If we unify the 2, and kdump through nfs and failure dump to rootfs, the mounting status will be like:

Arguments: --add 'kdumpbase' --quiet --hostonly --hostonly-cmdline --hostonly-i18n --hostonly-mode 'strict' --hostonly-nics '' -o 'plymouth resume ifcfg earlykdump' --mount '192.168.1.65:/tmp/nfs /sysroot nfs defaults' --squash-compressor 'zstd' --mount '/dev/mapper/rhel_ibm--p8--kvm--03--guest--02-root /sysroot xfs rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota'

The original mounting status will be like:

Arguments: --add 'kdumpbase' --quiet --hostonly --hostonly-cmdline --hostonly-i18n --hostonly-mode 'strict' --hostonly-nics '' -o 'plymouth resume ifcfg earlykdump' --mount '192.168.1.65:/tmp/nfs /kdumproot nfs defaults' --squash-compressor 'zstd' --mount '/dev/mapper/rhel_ibm--p8--kvm--03--guest--02-root /sysroot xfs rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota'

When nfs mount failed, it will try to mount sysroot as "systemctl start --no-block sysroot.mount", it will mount nfs again, instead of mounting the block device. So the mount will fail again, and dump_to_rootfs fails, which is unexpected.

see the 2nd kernel log:

[ 130.266759] mount[563]: mount.nfs: Connection refused [ 130.268887] systemd[1]: sysroot.mount: Mount process exited, code=exited, status=32/n/a [FAILED] Failed to mount /sysroot. See 'systemctl status sysroot.mount' for details. [ 130.272324] systemd[1]: Failed to mount /sysroot. ... [ 256.480662] kdump[700]: failed to dump to '/sysroot', it's not a mount point! [FAILED] Failed to start Kdump Vmcore Save Service. See 'systemctl status kdump-capture.service' for details. ... [ 257.293836] systemd[1]: sysroot.mount: Directory /sysroot to mount over is not empty, mounting anyway. [ 257.296748] systemd[1]: Mounting /sysroot... [ 257.304670] mount[776]: mount.nfs: rpc.statd is not running but is required for remote locking. ... [ 347.314093] kdump[981]: Failed to mount rootfs [ 347.316864] kdump[983]: Executing final action systemctl reboot -f

daveyoung commented 4 months ago

The v4 has been force pushed to this PR, which mainly introduced the "kdumpctl test" as the entrance for performing kdump test. @daveyoung @baoquan-he

In addition, in v3 Dave mentioned the combination of /kdumproot and /sysroot in kdump-lib.sh:get_kdump_mntpoint_from_target(), in order to simplify the dir mounting for saving $VMCORE_CREATION_STATUS. However I made an experiment showing the negative results, see the following:

If we unify the 2, and kdump through nfs and failure dump to rootfs, the mounting status will be like:

Arguments: --add 'kdumpbase' --quiet --hostonly --hostonly-cmdline --hostonly-i18n --hostonly-mode 'strict' --hostonly-nics '' -o 'plymouth resume ifcfg earlykdump' --mount '192.168.1.65:/tmp/nfs /sysroot nfs defaults' --squash-compressor 'zstd' --mount '/dev/mapper/rhel_ibm--p8--kvm--03--guest--02-root /sysroot xfs rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota'

The original mounting status will be like:

Arguments: --add 'kdumpbase' --quiet --hostonly --hostonly-cmdline --hostonly-i18n --hostonly-mode 'strict' --hostonly-nics '' -o 'plymouth resume ifcfg earlykdump' --mount '192.168.1.65:/tmp/nfs /kdumproot nfs defaults' --squash-compressor 'zstd' --mount '/dev/mapper/rhel_ibm--p8--kvm--03--guest--02-root /sysroot xfs rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota'

When nfs mount failed, it will try to mount sysroot as "systemctl start --no-block sysroot.mount", it will mount nfs again, instead of mounting the block device. So the mount will fail again, and dump_to_rootfs fails, which is unexpected.

see the 2nd kernel log:

[ 130.266759] mount[563]: mount.nfs: Connection refused [ 130.268887] systemd[1]: sysroot.mount: Mount process exited, code=exited, status=32/n/a [FAILED] Failed to mount /sysroot. See 'systemctl status sysroot.mount' for details. [ 130.272324] systemd[1]: Failed to mount /sysroot. ... [ 256.480662] kdump[700]: failed to dump to '/sysroot', it's not a mount point! [FAILED] Failed to start Kdump Vmcore Save Service. See 'systemctl status kdump-capture.service' for details. ... [ 257.293836] systemd[1]: sysroot.mount: Directory /sysroot to mount over is not empty, mounting anyway. [ 257.296748] systemd[1]: Mounting /sysroot... [ 257.304670] mount[776]: mount.nfs: rpc.statd is not running but is required for remote locking. ... [ 347.314093] kdump[981]: Failed to mount rootfs [ 347.316864] kdump[983]: Executing final action systemctl reboot -f

Hi Tao, I finally found your reply here. Thanks for trying the idea. It is fair enough it you can not use one mount point. Anyway I have still some questions about the mouting, added another comment in your new version

liutgnu commented 4 months ago

Hi Dave, I just force pusted the v5 a few hours ago. I tested the fadump case, and mainly fixed its error mounting issues. Please checkout the added code comments in mkdumprd.

liutgnu commented 4 months ago

Hi Philipp,

Thanks a lot for your careful review!

Hi Tao, here are some comments from my first round. I still need to think about a few things, e.g. I don't know if it really makes sense to store the status file on the rootfs. My first thought was that it would be better to store it on the target, next to the dumps instead. IMHO, that would make tracking the status easier, when you have a big fleet dumping to a remote host, e.g. via ssh. But I'm not entirely sure if we can assume that the target is always mounted/accessible. Especially when dumping to nfs.

Yes, saving the status file onto the target was my first option, but I gave it up due to the 2 reasons: 1) It is difficult to identify 'who created the vmcore/status file'. Say if we have a nfs/ssh server, several machines will dump vmcore/status file to it. Since we need to verify if kdump test is successful for one specific machine, we need to identify the vmcore/status file is created by it, no by others. Frankly there is no stable way to do that, the ip+date string isn't a good method, e.g the following case: physical machine_a hosted vm1 and vm2 by its NAT network, physical machine_b hosted vm3 and vm4 by its NAT network, and physical machine_c is for vmcore hosting. So vm1 and vm3 can have the same ip address as 10.0.2.15. How can we know if one vmcore/status file belongs to vm1 or vm3? My previous solution is to create a special string(a timestamp) in kernel before trigger kernel crash, and check the vmcore-dmesg.txt file if the string exists, to identify its belonging.

2) Though [1] is doable, but it need to access the vmcore server frenqently(each checking of kdump test success or not), QE/customers don't want us to access the vmcore server too much, unless we are dumping vmcore to it. In addition, accessing a remote server is slow, especially when there are hundreds of vmcores, which means we need to find the special string within hundreds vmcore-dmesg.txt files. It is slow, and nfs-server may not mounted as well...

So saving the status file locally is fast and have no problem to determine its belonging than saving remotely.

prudo1 commented 4 months ago

Hi Philipp,

Thanks a lot for your careful review!

Hi Tao, here are some comments from my first round. I still need to think about a few things, e.g. I don't know if it really makes sense to store the status file on the rootfs. My first thought was that it would be better to store it on the target, next to the dumps instead. IMHO, that would make tracking the status easier, when you have a big fleet dumping to a remote host, e.g. via ssh. But I'm not entirely sure if we can assume that the target is always mounted/accessible. Especially when dumping to nfs.

Yes, saving the status file onto the target was my first option, but I gave it up due to the 2 reasons: 1) It is difficult to identify 'who created the vmcore/status file'. Say if we have a nfs/ssh server, several machines will dump vmcore/status file to it. Since we need to verify if kdump test is successful for one specific machine, we need to identify the vmcore/status file is created by it, no by others. Frankly there is no stable way to do that, the ip+date string isn't a good method, e.g the following case: physical machine_a hosted vm1 and vm2 by its NAT network, physical machine_b hosted vm3 and vm4 by its NAT network, and physical machine_c is for vmcore hosting. So vm1 and vm3 can have the same ip address as 10.0.2.15. How can we know if one vmcore/status file belongs to vm1 or vm3? My previous solution is to create a special string(a timestamp) in kernel before trigger kernel crash, and check the vmcore-dmesg.txt file if the string exists, to identify its belonging.

Right, the ip+date is a pretty poor method to identify the system that dumped. My idea was to replace the ip with the machine-id. The machine-id is determined at the first boot after installation and then stays stable for ever. At least for "normal" installations. I'm not a 100% sure how it behaves for container images where the rootfs is ro. If I understand the man page correctly the machine-id is then determined every time on boot. So it won't be stable. But in that scenario the ip probably wouldn't be stable as well. So we could switch from ip+date to machine-id+date for the dumps and machine-id.status for the status files. Then we also don't need to check all the vmcore-dmesg.txt.

What do you think?

2. Though [1] is doable, but it need to access the vmcore server frenqently(each checking of kdump test success or not), QE/customers don't want us to access the vmcore server too much, unless we are dumping vmcore to it. In addition, accessing a remote server is slow, especially when there are hundreds of vmcores, which means we need to find the special string within hundreds vmcore-dmesg.txt files. It is slow, and nfs-server may not mounted as well...

True, but at least for ssh targets we already connect to them every time we start to check whether the network is up and running. So in that case there won't be too many additional connections. But then there is kdumpctl status where we don't know whether we can reach the target. And of course nfs targets...

So saving the status file locally is fast and have no problem to determine its belonging than saving remotely.

True, but it also means that we need to mount a disk locally for cases where we didn't need it before. Which will require additional memory and thus increases the chance that the kdump kernel runs oom...

liutgnu commented 4 months ago

Hi Philipp, Thanks a lot for your careful review!

Hi Tao, here are some comments from my first round. I still need to think about a few things, e.g. I don't know if it really makes sense to store the status file on the rootfs. My first thought was that it would be better to store it on the target, next to the dumps instead. IMHO, that would make tracking the status easier, when you have a big fleet dumping to a remote host, e.g. via ssh. But I'm not entirely sure if we can assume that the target is always mounted/accessible. Especially when dumping to nfs.

Yes, saving the status file onto the target was my first option, but I gave it up due to the 2 reasons: 1) It is difficult to identify 'who created the vmcore/status file'. Say if we have a nfs/ssh server, several machines will dump vmcore/status file to it. Since we need to verify if kdump test is successful for one specific machine, we need to identify the vmcore/status file is created by it, no by others. Frankly there is no stable way to do that, the ip+date string isn't a good method, e.g the following case: physical machine_a hosted vm1 and vm2 by its NAT network, physical machine_b hosted vm3 and vm4 by its NAT network, and physical machine_c is for vmcore hosting. So vm1 and vm3 can have the same ip address as 10.0.2.15. How can we know if one vmcore/status file belongs to vm1 or vm3? My previous solution is to create a special string(a timestamp) in kernel before trigger kernel crash, and check the vmcore-dmesg.txt file if the string exists, to identify its belonging.

Right, the ip+date is a pretty poor method to identify the system that dumped. My idea was to replace the ip with the machine-id. The machine-id is determined at the first boot after installation and then stays stable for ever. At least for "normal" installations. I'm not a 100% sure how it behaves for container images where the rootfs is ro. If I understand the man page correctly the machine-id is then determined every time on boot. So it won't be stable. But in that scenario the ip probably wouldn't be stable as well. So we could switch from ip+date to machine-id+date for the dumps and machine-id.status for the status files. Then we also don't need to check all the vmcore-dmesg.txt.

What do you think?

I guess we cannot do that, the machine-id should be encrpted and shouldn't be exposed to outeside. See the "man machine-id", as I quote some of it:

This ID uniquely identifies the host. It should be considered "confidential", and must not be exposed in untrusted environments, in particular on the network...

If we do using some hash/encrypt approach before using machine-id, I guess it will introduce extra complexity for kdump function.

2. Though [1] is doable, but it need to access the vmcore server frenqently(each checking of kdump test success or not), QE/customers don't want us to access the vmcore server too much, unless we are dumping vmcore to it. In addition, accessing a remote server is slow, especially when there are hundreds of vmcores, which means we need to find the special string within hundreds vmcore-dmesg.txt files. It is slow, and nfs-server may not mounted as well...
True, but at least for ssh targets we already connect to them every time we start to check whether the network is up and running. So in that case there won't be too many additional connections. But then there is kdumpctl status where we don't know whether we can reach the target. And of course nfs targets...

Yeah, since CEE will use "kdumpctl test" often, and I guess nfs targets are their priority. So we need to come up with a universal approach, not only for local/ssh, but also nfs target...

So saving the status file locally is fast and have no problem to determine its belonging than saving remotely.

True, but it also means that we need to mount a disk locally for cases where we didn't need it before. Which will require additional memory and thus increases the chance that the kdump kernel runs oom...

Yes agreed, but I guess it is OK to increase the creashkernel size a little bit in order to support this.

prudo1 commented 4 months ago

Hi Philipp, Thanks a lot for your careful review!

Hi Tao, here are some comments from my first round. I still need to think about a few things, e.g. I don't know if it really makes sense to store the status file on the rootfs. My first thought was that it would be better to store it on the target, next to the dumps instead. IMHO, that would make tracking the status easier, when you have a big fleet dumping to a remote host, e.g. via ssh. But I'm not entirely sure if we can assume that the target is always mounted/accessible. Especially when dumping to nfs.

Yes, saving the status file onto the target was my first option, but I gave it up due to the 2 reasons: 1) It is difficult to identify 'who created the vmcore/status file'. Say if we have a nfs/ssh server, several machines will dump vmcore/status file to it. Since we need to verify if kdump test is successful for one specific machine, we need to identify the vmcore/status file is created by it, no by others. Frankly there is no stable way to do that, the ip+date string isn't a good method, e.g the following case: physical machine_a hosted vm1 and vm2 by its NAT network, physical machine_b hosted vm3 and vm4 by its NAT network, and physical machine_c is for vmcore hosting. So vm1 and vm3 can have the same ip address as 10.0.2.15. How can we know if one vmcore/status file belongs to vm1 or vm3? My previous solution is to create a special string(a timestamp) in kernel before trigger kernel crash, and check the vmcore-dmesg.txt file if the string exists, to identify its belonging.

Right, the ip+date is a pretty poor method to identify the system that dumped. My idea was to replace the ip with the machine-id. The machine-id is determined at the first boot after installation and then stays stable for ever. At least for "normal" installations. I'm not a 100% sure how it behaves for container images where the rootfs is ro. If I understand the man page correctly the machine-id is then determined every time on boot. So it won't be stable. But in that scenario the ip probably wouldn't be stable as well. So we could switch from ip+date to machine-id+date for the dumps and machine-id.status for the status files. Then we also don't need to check all the vmcore-dmesg.txt. What do you think?

I guess we cannot do that, the machine-id should be encrpted and shouldn't be exposed to outeside. See the "man machine-id", as I quote some of it:

This ID uniquely identifies the host. It should be considered "confidential", and must not be exposed in untrusted environments, in particular on the network...

Yeah, I read that as well. But I believe that for our use case we can assume that the dump target can be trusted even when it's a remote target. Reason is that, if we cannot trust the dump target we cannot write the dump there as well. In the end a kernel dump will always contain a lot of confidential information. So if we write a dump to an untrusted target the machine-id is one of the smaller problems we have.

If we do using some hash/encrypt approach before using machine-id, I guess it will introduce extra complexity for kdump function.

Fully agree, but as said above IMHO that shouldn't be necessary.

2. Though [1] is doable, but it need to access the vmcore server frenqently(each checking of kdump test success or not), QE/customers don't want us to access the vmcore server too much, unless we are dumping vmcore to it. In addition, accessing a remote server is slow, especially when there are hundreds of vmcores, which means we need to find the special string within hundreds vmcore-dmesg.txt files. It is slow, and nfs-server may not mounted as well...
True, but at least for ssh targets we already connect to them every time we start to check whether the network is up and running. So in that case there won't be too many additional connections. But then there is kdumpctl status where we don't know whether we can reach the target. And of course nfs targets...
Yeah, since CEE will use "kdumpctl test" often, and I guess nfs targets are their priority. So we need to come up with a universal approach, not only for local/ssh, but also nfs target...

True...

So saving the status file locally is fast and have no problem to determine its belonging than saving remotely.

True, but it also means that we need to mount a disk locally for cases where we didn't need it before. Which will require additional memory and thus increases the chance that the kdump kernel runs oom...

Yes agreed, but I guess it is OK to increase the creashkernel size a little bit in order to support this.

Yes, I guess so...

liutgnu commented 4 months ago

Now the v7 is posted

liutgnu commented 4 months ago

Hi Philipp, Thanks a lot for your careful review!

Hi Tao, here are some comments from my first round. I still need to think about a few things, e.g. I don't know if it really makes sense to store the status file on the rootfs. My first thought was that it would be better to store it on the target, next to the dumps instead. IMHO, that would make tracking the status easier, when you have a big fleet dumping to a remote host, e.g. via ssh. But I'm not entirely sure if we can assume that the target is always mounted/accessible. Especially when dumping to nfs.

Yes, saving the status file onto the target was my first option, but I gave it up due to the 2 reasons: 1) It is difficult to identify 'who created the vmcore/status file'. Say if we have a nfs/ssh server, several machines will dump vmcore/status file to it. Since we need to verify if kdump test is successful for one specific machine, we need to identify the vmcore/status file is created by it, no by others. Frankly there is no stable way to do that, the ip+date string isn't a good method, e.g the following case: physical machine_a hosted vm1 and vm2 by its NAT network, physical machine_b hosted vm3 and vm4 by its NAT network, and physical machine_c is for vmcore hosting. So vm1 and vm3 can have the same ip address as 10.0.2.15. How can we know if one vmcore/status file belongs to vm1 or vm3? My previous solution is to create a special string(a timestamp) in kernel before trigger kernel crash, and check the vmcore-dmesg.txt file if the string exists, to identify its belonging.

Right, the ip+date is a pretty poor method to identify the system that dumped. My idea was to replace the ip with the machine-id. The machine-id is determined at the first boot after installation and then stays stable for ever. At least for "normal" installations. I'm not a 100% sure how it behaves for container images where the rootfs is ro. If I understand the man page correctly the machine-id is then determined every time on boot. So it won't be stable. But in that scenario the ip probably wouldn't be stable as well. So we could switch from ip+date to machine-id+date for the dumps and machine-id.status for the status files. Then we also don't need to check all the vmcore-dmesg.txt. What do you think?

I guess we cannot do that, the machine-id should be encrpted and shouldn't be exposed to outeside. See the "man machine-id", as I quote some of it: This ID uniquely identifies the host. It should be considered "confidential", and must not be exposed in untrusted environments, in particular on the network...

Yeah, I read that as well. But I believe that for our use case we can assume that the dump target can be trusted even when it's a remote target. Reason is that, if we cannot trust the dump target we cannot write the dump there as well. In the end a kernel dump will always contain a lot of confidential information. So if we write a dump to an untrusted target the machine-id is one of the smaller problems we have.

OK, agreed. However one thing I noticed from my fedora:37 container, the systemd package is not installed: [root@3c6570f17fba system]# rpm -qa|grep systemd systemd-libs-251.10-588.fc37.x86_64

So there is no systemd-machine-id-commit.service to generate the machine-id, instead we got and empty file: [root@3c6570f17fba system]# rpm -qf /etc/machine-id file /etc/machine-id is not owned by any package [root@3c6570f17fba system]# ls -ail /etc/machine-id 3046437 -rw-r--r--. 1 root root 0 Feb 5 2023 /etc/machine-id

I guess machine-id is not a stable approach for containers than ip+date...

If we do using some hash/encrypt approach before using machine-id, I guess it will introduce extra complexity for kdump function.

Fully agree, but as said above IMHO that shouldn't be necessary.
2. Though [1] is doable, but it need to access the vmcore server frenqently(each checking of kdump test success or not), QE/customers don't want us to access the vmcore server too much, unless we are dumping vmcore to it. In addition, accessing a remote server is slow, especially when there are hundreds of vmcores, which means we need to find the special string within hundreds vmcore-dmesg.txt files. It is slow, and nfs-server may not mounted as well...
True, but at least for ssh targets we already connect to them every time we start to check whether the network is up and running. So in that case there won't be too many additional connections. But then there is kdumpctl status where we don't know whether we can reach the target. And of course nfs targets...
Yeah, since CEE will use "kdumpctl test" often, and I guess nfs targets are their priority. So we need to come up with a universal approach, not only for local/ssh, but also nfs target...
True...

So saving the status file locally is fast and have no problem to determine its belonging than saving remotely.

True, but it also means that we need to mount a disk locally for cases where we didn't need it before. Which will require additional memory and thus increases the chance that the kdump kernel runs oom...

Yes agreed, but I guess it is OK to increase the creashkernel size a little bit in order to support this.

Yes, I guess so...

liutgnu commented 3 months ago

Hi @prudo1 , kindly ping. Could you please review the patch? If you have no time, then we can ask other's help. Thanks!

liutgnu commented 3 months ago

Hi @prudo1 ,

Hi Philipp, Thanks a lot for your careful review!

Hi Tao, here are some comments from my first round. I still need to think about a few things, e.g. I don't know if it really makes sense to store the status file on the rootfs. My first thought was that it would be better to store it on the target, next to the dumps instead. IMHO, that would make tracking the status easier, when you have a big fleet dumping to a remote host, e.g. via ssh. But I'm not entirely sure if we can assume that the target is always mounted/accessible. Especially when dumping to nfs.

Yes, saving the status file onto the target was my first option, but I gave it up due to the 2 reasons: 1) It is difficult to identify 'who created the vmcore/status file'. Say if we have a nfs/ssh server, several machines will dump vmcore/status file to it. Since we need to verify if kdump test is successful for one specific machine, we need to identify the vmcore/status file is created by it, no by others. Frankly there is no stable way to do that, the ip+date string isn't a good method, e.g the following case: physical machine_a hosted vm1 and vm2 by its NAT network, physical machine_b hosted vm3 and vm4 by its NAT network, and physical machine_c is for vmcore hosting. So vm1 and vm3 can have the same ip address as 10.0.2.15. How can we know if one vmcore/status file belongs to vm1 or vm3? My previous solution is to create a special string(a timestamp) in kernel before trigger kernel crash, and check the vmcore-dmesg.txt file if the string exists, to identify its belonging.

Right, the ip+date is a pretty poor method to identify the system that dumped. My idea was to replace the ip with the machine-id. The machine-id is determined at the first boot after installation and then stays stable for ever. At least for "normal" installations. I'm not a 100% sure how it behaves for container images where the rootfs is ro. If I understand the man page correctly the machine-id is then determined every time on boot. So it won't be stable. But in that scenario the ip probably wouldn't be stable as well. So we could switch from ip+date to machine-id+date for the dumps and machine-id.status for the status files. Then we also don't need to check all the vmcore-dmesg.txt. What do you think?

I guess we cannot do that, the machine-id should be encrpted and shouldn't be exposed to outeside. See the "man machine-id", as I quote some of it: This ID uniquely identifies the host. It should be considered "confidential", and must not be exposed in untrusted environments, in particular on the network...

Yeah, I read that as well. But I believe that for our use case we can assume that the dump target can be trusted even when it's a remote target. Reason is that, if we cannot trust the dump target we cannot write the dump there as well. In the end a kernel dump will always contain a lot of confidential information. So if we write a dump to an untrusted target the machine-id is one of the smaller problems we have.

OK, agreed. However one thing I noticed from my fedora:37 container, the systemd package is not installed: [root@3c6570f17fba system]# rpm -qa|grep systemd systemd-libs-251.10-588.fc37.x86_64

So there is no systemd-machine-id-commit.service to generate the machine-id, instead we got and empty file: [root@3c6570f17fba system]# rpm -qf /etc/machine-id file /etc/machine-id is not owned by any package [root@3c6570f17fba system]# ls -ail /etc/machine-id 3046437 -rw-r--r--. 1 root root 0 Feb 5 2023 /etc/machine-id

I guess machine-id is not a stable approach for containers than ip+date...

I re-thought about the machine-id approach, and I think the above container case is not correct, kdump is not supposted to run within a container. However there are cases as CoreOS which uses ostree as layers for fs. Though CoreOS will mount /etc as writable, aka the machine-id can be generated in the first boot and stays there, but I'm not sure if there are cases which will mount /etc as read-only, or users can make such a customization. I'm a newbie to ostree :(

The ip+date string, though very poor, but better for human reading. E.g. /var/crash/10.0.2.15-2024-07-11-07:54:41, is much readable than /var/crash/machine-id-1d4c642df43846/. The former one indication the machine is more like to be a VM, so if we are searching for one bare-metal dumped vmcore, this one is easily to skipped.

So IMOH I don't have very strong motivation for replacing ip+date to be machine-id string for vmcore folder naming for now.

Other than machine-id for vmcore labeling, Dave gave me a different solution: 1) before trigger kernel panic, we generate a unique string, and put it into a file. 2) mkdumprd include the file into initramfs-kdump.img. 3) in 2nd kernel, after creating vmcore, cp the file aside of vmcore. 4) kdumpctl status can check if the string aside of vmcore is expected. The file can also connect with kdump.status, if we draft the file content by some format. This method is easy and doable, if we do want to save the kdump.status on to the remote machine.

Personally I would prefer to save the kdump.status into local drive rather than remote, just because it can make "kdumpctl status" working faster and reliable, since we don't need to care about the future network status after "kdumpctl test".

Any comments?

Thanks, Tao Liu

If we do using some hash/encrypt approach before using machine-id, I guess it will introduce extra complexity for kdump function.

Fully agree, but as said above IMHO that shouldn't be necessary.
2. Though [1] is doable, but it need to access the vmcore server frenqently(each checking of kdump test success or not), QE/customers don't want us to access the vmcore server too much, unless we are dumping vmcore to it. In addition, accessing a remote server is slow, especially when there are hundreds of vmcores, which means we need to find the special string within hundreds vmcore-dmesg.txt files. It is slow, and nfs-server may not mounted as well...
True, but at least for ssh targets we already connect to them every time we start to check whether the network is up and running. So in that case there won't be too many additional connections. But then there is kdumpctl status where we don't know whether we can reach the target. And of course nfs targets...
Yeah, since CEE will use "kdumpctl test" often, and I guess nfs targets are their priority. So we need to come up with a universal approach, not only for local/ssh, but also nfs target...
True...

So saving the status file locally is fast and have no problem to determine its belonging than saving remotely.

True, but it also means that we need to mount a disk locally for cases where we didn't need it before. Which will require additional memory and thus increases the chance that the kdump kernel runs oom...

Yes agreed, but I guess it is OK to increase the creashkernel size a little bit in order to support this.

Yes, I guess so...

liutgnu commented 2 months ago

Hi @liutgnu, this PR is almost perfect. Just a few very small nits. The biggest "problems" I see is that with #33 merged you need to rebase to the latest main, as there will be some conflicts. In addition you need to add the new sub-command to kdumpctl.8 man page. With that fixed the PR can be merged. Thanks!

Hi @prudo1 , thanks a lot for your review. I will get those improved in v8

liutgnu commented 2 months ago

Hi @liutgnu, this PR is almost perfect. Just a few very small nits. The biggest "problems" I see is that with #33 merged you need to rebase to the latest main, as there will be some conflicts. In addition you need to add the new sub-command to kdumpctl.8 man page. With that fixed the PR can be merged. Thanks!

Hi @prudo1 , thanks a lot for your review. I will get those improved in v8

Hi @prudo1 , please see the v8 I just posted. Thanks in advance!