Closed spetrosi closed 3 years ago
@pcahyna, I specified KDUMP_BOOTDIR
in /etc/sysconfig/kdump
and it solved the failing tests_defaults.yml
. Now tests_ssh.yml fails. Any idea how to fix this?
@spetrosi , that's a weird error. "The ifcfg-dev or ifcfg-xxx which contains DEVICE=dev field doesn't exist."
The code now looks good, and still solves the original problem. For the newly discovered one, I think you will have to do some local debugging.
Apparently, RHEL 6 does not work too. The tasks Generate /etc/kdump.conf
and Generate /etc/sysconfig/kdump
got skipped because the /sys/kernel/kexec_crash_size
file has 0
in it, and configs generate only when the value in this file is greater than 0.
This happens because by default kdump
does not start on boot.
This can be solved by running chkconfig kdump on
and rebooting the system. Then, a required amount of memory is reserved for kdump and tests_default runs normally.
This, however, does not fix the tests_ssh
check, this test still fails on RHEL 6 with the same error as with CentOS 6.
On CentOS 6 kdump is starts on boot by default and hence tests_defaults
runs normally.
I think that for RHEL 6 the role must configure everything regardless of values in /sys/kernel/kexec_crash_size
and then set the reboot_required: yes
fact.
Our CI tests can do the reboot to test RHEL 6 properlly.
@pcahyna how does this sound?
This also kinda goes beyond #74, what is a usual workflow in this case? I can proceed work within this PR because it's small anyway. Or is it better to raise new issues and PRs for tests_ssh.yml and RHEL 6 issues?
Thank you
Apparently, RHEL 6 does not work too. The tasks
Generate /etc/kdump.conf
andGenerate /etc/sysconfig/kdump
got skipped because the/sys/kernel/kexec_crash_size
file has0
in it, and configs generate only when the value in this file is greater than 0. This happens because by defaultkdump
does not start on boot. This can be solved by runningchkconfig kdump on
and rebooting the system. Then, a required amount of memory is reserved for kdump and tests_default runs normally.
Do you know why is it needed for RHEL 6 and not CentOS 6? I would expect them to be identical.
Do you know why is it needed for RHEL 6 and not CentOS 6? I would expect them to be identical.
Because in RHEL 6 kdump is not enabled to start on boot, and in CentOS 6 it is. Default in RHEL 6:
# chkconfig --list kdump
kdump 0:off 1:off 2:off 3:off 4:off 5:off 6:off
Default in CentOS 6:
# chkconfig --list kdump
kdump 0:off 1:off 2:off 3:on 4:on 5:on 6:off
However, after enabling kdump to start on boot, it also gets 2:on
:
[root@ibm-p8-kvm-03-guest-02 ~]# chkconfig kdump on
[root@ibm-p8-kvm-03-guest-02 ~]# chkconfig --list kdump
kdump 0:off 1:off 2:on 3:on 4:on 5:on 6:off
Because in RHEL 6 kdump is not enabled to start on boot, and in CentOS 6 it is.
Some difference in how the qcow images are built?
I have a rhel6 latest and a centos6 latest system.
both have kexec-tools installed by default.
rhel6: chkconfig --list kdump: kdump 0:off 1:off 2:off 3:on 4:on 5:on 6:off
centos6: chkconfig --list kdump: kdump 0:off 1:off 2:off 3:on 4:on 5:on 6:off
Neither system has kdump running - ps -ef|grep kdump
is empty
Both systems have /sys/kernel/kexec_crash_size = 135266304
I have a rhel6 latest and a centos6 latest system. both have kexec-tools installed by default. rhel6: chkconfig --list kdump:
kdump 0:off 1:off 2:off 3:on 4:on 5:on 6:off
centos6: chkconfig --list kdump:kdump 0:off 1:off 2:off 3:on 4:on 5:on 6:off
Neither system has kdump running -
ps -ef|grep kdump
is empty Both systems have /sys/kernel/kexec_crash_size = 135266304
That is good, if /sys/kernel/kexec_crash_size
is not zero by default then this role will work fine on RHEL 6. This is something, but the image in the CI does not have this by default, no idea why. We can fix it from our side by enabling kdump
and restarting the image. Is it possible to request this to be fixed within the image? RHEL 6 is EOL and it seems not realistic to me, but please correct me if I am wrong.
@pcahyna I think I have found why tests_ssh.yml does not work on EL 6.
/sbin/mkdumprd
on lines 1745 and 1756 does the following to find the device that it must use to connect to the remote server to store kdump logs:
netdev=`/sbin/ip route get to $remoteip 2>&1`
netdev=`echo $netdev|awk '{print $3}'|head -n 1`
In reality, the first command has this output:
[root@ibm-p8-kvm-03-guest-02 ~]# /sbin/ip route get to 10.0.2.15
local 10.0.2.15 dev lo src 10.0.2.15
And the second command cuts it like that:
[root@ibm-p8-kvm-03-guest-02 ~]# /sbin/ip route get to 10.0.2.15 | awk '{print $3}'|head -n 1
dev
tests_ssh.yml
points to the localhost, which is not expected by mkdumprd
at all and mkdumprd
does not expect this local
string in the beginning of the ip route get
output. I made this conclusion because when running ip route get
to a real IP address within the same network, the third argument to pick by awk
is eth0
:
# ip route get 10.0.2.2
10.0.2.2 dev eth0 src 10.0.2.15
But setting netdev to lo
here does not help either, kdump still fails to start in this case too.
What helps though, is to set netdev
to an actual working interface - eth0
. In this case, kdump starts normally.
The easy way to solve this is to manually set netdev=eth0
in /sbin/mkdumprd
. How does that sound?
We can fix it from our side by enabling kdump and restarting the image.
Ok - we can do that in the test.
Is it possible to request this to be fixed within the image? RHEL 6 is EOL and it seems not realistic to me, but please correct me if I am wrong.
Extremely difficult, if not impossible, to fix in the image.
@pcahyna I implemented my idea and the checks have succeeded. Please review.
lgtm - but I'll defer to @pcahyna for final approval
I believe you should use only the first two commits. The commits that touch tests seem to be just papering over an actual problem in the role that should be fixed separately instead of being hidden by a test change. Also, the first two commits should be squashed together and the commit message updated before merging.
I believe you should use only the first two commits. The commits that touch tests seem to be just papering over an actual problem in the role that should be fixed separately instead of being hidden by a test change. Also, the first two commits should be squashed together and the commit message updated before merging.
Done
[citest]
[citest bad]
[citest pending]
[citest bad]
Thanks, @pcahyna, I am merging because the test failure is expected due to the above discussions.
Fix the kdump.j2 template to work on CentOS 6 too by changing
ansible_distribution == 'RedHat'
toansible_os_family == 'RedHat'
.Enable kdump and perform a reboot on RHEL 6 because by default RHEL 6 image in the CI does not have memory reserved for kdump. This issue only happens with the image, a default installation of RHEL 6 has memory reserved, so fixing it in tests only.
For tests_ssh.yml, manually specify the
netdev
variable in/sbin/mkdumprd
because by default mkdumprd does not expect the host to connect to itself via SSH. This is an issue with the CI only because the test is single-host.