linux-system-roles / kdump

An ansible role which configures kdump
https://linux-system-roles.github.io/kdump/
MIT License
19 stars 21 forks source link

Add `KDUMP_BOOTDIR="/boot"`to `/etc/sysconfig/kdump` for RedHat oses <7 #76

Closed spetrosi closed 3 years ago

spetrosi commented 3 years ago

Fix the kdump.j2 template to work on CentOS 6 too by changing ansible_distribution == 'RedHat' to ansible_os_family == 'RedHat'.

Enable kdump and perform a reboot on RHEL 6 because by default RHEL 6 image in the CI does not have memory reserved for kdump. This issue only happens with the image, a default installation of RHEL 6 has memory reserved, so fixing it in tests only.

For tests_ssh.yml, manually specify the netdev variable in /sbin/mkdumprd because by default mkdumprd does not expect the host to connect to itself via SSH. This is an issue with the CI only because the test is single-host.

spetrosi commented 3 years ago

@pcahyna, I specified KDUMP_BOOTDIR in /etc/sysconfig/kdump and it solved the failing tests_defaults.yml. Now tests_ssh.yml fails. Any idea how to fix this?

pcahyna commented 3 years ago

@spetrosi , that's a weird error. "The ifcfg-dev or ifcfg-xxx which contains DEVICE=dev field doesn't exist."

pcahyna commented 3 years ago

The code now looks good, and still solves the original problem. For the newly discovered one, I think you will have to do some local debugging.

spetrosi commented 3 years ago

Apparently, RHEL 6 does not work too. The tasks Generate /etc/kdump.conf and Generate /etc/sysconfig/kdump got skipped because the /sys/kernel/kexec_crash_size file has 0 in it, and configs generate only when the value in this file is greater than 0. This happens because by default kdump does not start on boot. This can be solved by running chkconfig kdump on and rebooting the system. Then, a required amount of memory is reserved for kdump and tests_default runs normally. This, however, does not fix the tests_ssh check, this test still fails on RHEL 6 with the same error as with CentOS 6. On CentOS 6 kdump is starts on boot by default and hence tests_defaults runs normally. I think that for RHEL 6 the role must configure everything regardless of values in /sys/kernel/kexec_crash_size and then set the reboot_required: yes fact. Our CI tests can do the reboot to test RHEL 6 properlly. @pcahyna how does this sound? This also kinda goes beyond #74, what is a usual workflow in this case? I can proceed work within this PR because it's small anyway. Or is it better to raise new issues and PRs for tests_ssh.yml and RHEL 6 issues? Thank you

pcahyna commented 3 years ago

Apparently, RHEL 6 does not work too. The tasks Generate /etc/kdump.conf and Generate /etc/sysconfig/kdump got skipped because the /sys/kernel/kexec_crash_size file has 0 in it, and configs generate only when the value in this file is greater than 0. This happens because by default kdump does not start on boot. This can be solved by running chkconfig kdump on and rebooting the system. Then, a required amount of memory is reserved for kdump and tests_default runs normally.

Do you know why is it needed for RHEL 6 and not CentOS 6? I would expect them to be identical.

spetrosi commented 3 years ago

Do you know why is it needed for RHEL 6 and not CentOS 6? I would expect them to be identical.

Because in RHEL 6 kdump is not enabled to start on boot, and in CentOS 6 it is. Default in RHEL 6:

# chkconfig --list kdump
kdump           0:off   1:off   2:off   3:off   4:off   5:off   6:off

Default in CentOS 6:

# chkconfig --list kdump
kdump           0:off   1:off   2:off   3:on    4:on    5:on    6:off

However, after enabling kdump to start on boot, it also gets 2:on:

[root@ibm-p8-kvm-03-guest-02 ~]# chkconfig kdump on 
[root@ibm-p8-kvm-03-guest-02 ~]# chkconfig --list kdump
kdump           0:off   1:off   2:on    3:on    4:on    5:on    6:off
richm commented 3 years ago

Because in RHEL 6 kdump is not enabled to start on boot, and in CentOS 6 it is.

Some difference in how the qcow images are built?

richm commented 3 years ago

I have a rhel6 latest and a centos6 latest system. both have kexec-tools installed by default. rhel6: chkconfig --list kdump: kdump 0:off 1:off 2:off 3:on 4:on 5:on 6:off centos6: chkconfig --list kdump: kdump 0:off 1:off 2:off 3:on 4:on 5:on 6:off

Neither system has kdump running - ps -ef|grep kdump is empty Both systems have /sys/kernel/kexec_crash_size = 135266304

spetrosi commented 3 years ago

I have a rhel6 latest and a centos6 latest system. both have kexec-tools installed by default. rhel6: chkconfig --list kdump: kdump 0:off 1:off 2:off 3:on 4:on 5:on 6:off centos6: chkconfig --list kdump: kdump 0:off 1:off 2:off 3:on 4:on 5:on 6:off

Neither system has kdump running - ps -ef|grep kdump is empty Both systems have /sys/kernel/kexec_crash_size = 135266304

That is good, if /sys/kernel/kexec_crash_size is not zero by default then this role will work fine on RHEL 6. This is something, but the image in the CI does not have this by default, no idea why. We can fix it from our side by enabling kdump and restarting the image. Is it possible to request this to be fixed within the image? RHEL 6 is EOL and it seems not realistic to me, but please correct me if I am wrong.

spetrosi commented 3 years ago

@pcahyna I think I have found why tests_ssh.yml does not work on EL 6. /sbin/mkdumprd on lines 1745 and 1756 does the following to find the device that it must use to connect to the remote server to store kdump logs:

netdev=`/sbin/ip route get to $remoteip 2>&1`
netdev=`echo $netdev|awk '{print $3}'|head -n 1`

In reality, the first command has this output:

[root@ibm-p8-kvm-03-guest-02 ~]# /sbin/ip route get to 10.0.2.15
local 10.0.2.15 dev lo  src 10.0.2.15 

And the second command cuts it like that:

[root@ibm-p8-kvm-03-guest-02 ~]# /sbin/ip route get to 10.0.2.15 | awk '{print $3}'|head -n 1
dev

tests_ssh.yml points to the localhost, which is not expected by mkdumprd at all and mkdumprd does not expect this local string in the beginning of the ip route get output. I made this conclusion because when running ip route get to a real IP address within the same network, the third argument to pick by awk is eth0:

# ip route get 10.0.2.2
10.0.2.2 dev eth0  src 10.0.2.15 

But setting netdev to lo here does not help either, kdump still fails to start in this case too. What helps though, is to set netdev to an actual working interface - eth0. In this case, kdump starts normally. The easy way to solve this is to manually set netdev=eth0 in /sbin/mkdumprd. How does that sound?

richm commented 3 years ago

We can fix it from our side by enabling kdump and restarting the image.

Ok - we can do that in the test.

Is it possible to request this to be fixed within the image? RHEL 6 is EOL and it seems not realistic to me, but please correct me if I am wrong.

Extremely difficult, if not impossible, to fix in the image.

spetrosi commented 3 years ago

@pcahyna I implemented my idea and the checks have succeeded. Please review.

richm commented 3 years ago

lgtm - but I'll defer to @pcahyna for final approval

pcahyna commented 3 years ago

I believe you should use only the first two commits. The commits that touch tests seem to be just papering over an actual problem in the role that should be fixed separately instead of being hidden by a test change. Also, the first two commits should be squashed together and the commit message updated before merging.

spetrosi commented 3 years ago

I believe you should use only the first two commits. The commits that touch tests seem to be just papering over an actual problem in the role that should be fixed separately instead of being hidden by a test change. Also, the first two commits should be squashed together and the commit message updated before merging.

Done

jharuda commented 3 years ago

[citest]

richm commented 3 years ago

[citest bad]

richm commented 3 years ago

[citest pending]

richm commented 3 years ago

[citest bad]

spetrosi commented 3 years ago

Thanks, @pcahyna, I am merging because the test failure is expected due to the above discussions.