CentOS CI "get well" plan

mrc0mmand commented 5 years ago

Purpose of this issue is to keep track of things which need to be done to make systemd CentOS CI work again.

Following things still need to be done:

[x] Run all tests from the upstream testsuite (TEST-*) [agent/testsuite.sh] (PR #19, update: 4dda41ecb7f686e1b7e6cc7345a708667bb4a9c0)
- [x] Investigate TEST-24-UNIT-TESTS (e4cb16694e96fd8fe5abd4c5f86b92d5f3f1829e)
[x] Sync downstream tests [agent/tests] (PR #17)
[x] Support beakerlib libraries (PR #17)
[x] Run the remaining tests from the test/ directory (issue #14, 195510e8dbc16d752a4247c29fe8a970fb88d8a4)
[x] Enable make check (1e0e1930e0b1b68cea90d2a445231d191d0c0838)
[ ] Investigate fails in the beakerlib testsuite [agent/systemd]
[ ] Import missing downstream sanity tests [agent/systemd]
[x] Use both qemu-kvm and systemd-nspawn for tests/TEST-??-* [agent/testsuite.sh]
- [x] systemd-nspawn: add user_namespace.enable=1 to the kernel cmdline and set user.max_user_namespaces > 0
- [x] qemu-kvm: install qemu-kvm and create symlink to /usr/libexec/qemu-kvm
[x] Export artifacts from the test machine (see https://wiki.centos.org/QaWiki/CI/GettingStarted#head-a46ee49e8818ef9b50225c4e9d429f7a079758d2) (PR #27)
[ ] Comprehensive landing page with results
[ ] Probably automate dependency rebuild in copr (right now it's more like a quick hack to make the CI working)
[x] Verify if it works in the CentOS CI environment (right now I'm using a local VM for quick tests) [waiting on ticket for credentials]
[x] Enable CI for this repository
[x] Re-enable the CentOS CI when everything works as expected
[ ] Setup and enable dracut crypt SSH

Long term goals:

[ ] Upstream all downstream RHEL tests (= drop annoying test syncing)

Notes:

reverted back to CentOS kernel (3.10) - see https://github.com/systemd/systemd-centos-ci/issues/14#issuecomment-431558583, https://github.com/systemd/systemd/issues/10474
- added busybox to the copr repo

mrc0mmand commented 5 years ago

* [ ]  (?) Use qemu-kvm instead of systemd-nspawn (or both?) [slave/testsuite.sh]

@keszybz any thoughts on this one?

mrc0mmand commented 5 years ago

Failing tests:

~~TEST-01-BASIC~~
~~TEST-15-DROPIN~~
~~TEST-22-TMPFILES~~
~~TEST-24-UNIT-TESTS~~

Common error:

+ env --unset=UNIFIED_CGROUP_HIERARCHY /root/systemd-centos-ci/systemd/build/systemd-nspawn -U --private-network --register=no --kill-signal=SIGKILL --directory=/var/tmp/systemd-test.UnjLx7/unprivileged-nspawn-root /usr/lib/systemd/systemd
Spawning container unprivileged-nspawn-root on /var/tmp/systemd-test.UnjLx7/unprivileged-nspawn-root.
Press ^] three times within 1s to kill container.
Selected user namespace base 970129408 and range 65536.
Failed to fork inner child: Invalid argument
E: nspawn failed with exit code 1
-rw-r-----+ 1 root systemd-journal 8388608 Oct 21 15:07 /var/tmp/systemd-test.UnjLx7/journal/32e86a2de4e543fe8c41793961c76987/system.journal
make: *** [run] Error 1
make: Leaving directory `/root/systemd-centos-ci/systemd/test/
--x-- Result of TEST-01-BASIC: 2 --x--

This happens even with user_namespace.enable=1

[root@host-8-251-180 systemd]# tr ' ' '\n' </proc/cmdline | grep user_namespace 
user_namespace.enable=1
[root@host-8-251-180 systemd]# cat /proc/cmdline 
BOOT_IMAGE=/boot/vmlinuz-3.10.0-862.14.4.el7.x86_64 root=UUID=40735eda-bc43-4610-961f-bc5c0353239a ro console=tty0 console=ttyS0,115200 crashkernel=auto net.ifnames=0 rhgb quiet LANG=en_US.UTF-8 user_namespace.enable=1

Workaround/fix:

# echo 10000 > /proc/sys/user/max_user_namespaces

~~Persisting issues:~~ Fixed by installing missing dependencies (quota, net-tools)

# make -C test/TEST-22-TMPFILES/ setup
...
+ for _x in inst_symlink inst_script inst_binary inst_simple
+ inst_simple ldconfig.real
+ [[ -f ldconfig.real ]]
+ return 1
+ return 1
+ [[ yes = yes ]]
+ dinfo 'Skipping program ldconfig.real as it cannot be found and is' 'flagged to be optional'
+ set +x
I: Skipping program ldconfig.real as it cannot be found and is flagged to be optional
make: *** [setup] Error 1
make: Leaving directory `/root/systemd-centos-ci/systemd/test/TEST-22-TMPFILES'

...

keszybz commented 5 years ago

[ ] (?) Use qemu-kvm instead of systemd-nspawn (or both?) [slave/testsuite.sh]

"Both" is worthwhile, because different things are tested in both environments. But reliability is more important than having both, so if just one can be made to work, that's better than having flaky tests.

mrc0mmand commented 5 years ago

Notes from the "make the QEMU testsuite work again" session:

run each test with correct INITRD and KERNEL_IMG env vars (/boot/initramfs-$(uname -r).img and /boot/vmlinuz-$(uname -r) respectively)
dracut includes filesystem modules ONLY for filesystems currently present in /etc/fstab (imo), which is xfs for default CentOS installation. However, the QEMU testsuite uses ext4 filesystem, which results is a boot failure for the respective virtual machine (dracut -f --filesystems ext4 for the rescue)
switching between nspawn/QEMU is currently done by removing/creating the /usr/bin/qemu-kvm symlink (to /usr/libexec/qemu-kvm). Maybe there's a nicer way
https://github.com/systemd/systemd/issues/10544

mrc0mmand commented 5 years ago

@evverx With the help of several other people I finally got something, which could get things moving again - I'm going to propose this ticket at CentOS CBS meeting (every Monday, 2 PM UTC in #centos-devel@Freenode) and hopefully it will get us somewhere.

mrc0mmand commented 5 years ago

Apparently there was some error in communication, so I didn't receive the previous email. However I finally got the credentials, so we can start breaking things!

(OT: Is there any chat to catch you in (e.g. IRC, Telegram, etc.)? @evverx)

evverx commented 5 years ago

However I finally got the credentials, so we can start breaking things!

That's great news! Congratulations!

Is there any chat to catch you in (e.g. IRC, Telegram, etc.)?

I'm afraid it isn't possible to catch me there, but, on the positive side, I usually reply to comments on GitHub relatively fast.

mrc0mmand commented 5 years ago

Notes from the "why it doesn't work in CentOS CI infrastructure" session:

the target nodes apparently don't like the new initrd image generated by upstream dracut - the machine won't boot after reboot (dropping the dracut initrd re-generation solves the issue for now); will investigate further
it's not the upstream dracut, the same thing happens after dracut -f --regenerate-all with the downstream package
I can either re-generate the initrd or install upstream systemd; if I do both, the system won't boot (and debugging boot issues without a serial console is wonderful...)

evverx commented 5 years ago

Could it be that you ran into https://github.com/systemd/systemd/issues/10854? There're two PRs that are supposed to fix the issue. Could you try applying one of them to see if it works?

yuwata commented 5 years ago

If the failure is caused by https://github.com/systemd/systemd/issues/10854, then please provide any logs or something if possible. Thank you.

yuwata commented 5 years ago

Another possibility is https://github.com/systemd/systemd/issues/10754...

mrc0mmand commented 5 years ago

Unfortunately, neither mentioned issue seems to be relevant for this case. I did a quick bisect, but the issue occurs all the way down to https://github.com/systemd/systemd/commit/80df8f2518aa07ef3c328e1c634573347e130cf0 - without this commit the systemd won't compile, will try to workaround it tomorrow.

Also, I'll try to ask for some possibility to get any useful logs from the machine after it dies.

Anyway, in my opinion, the issue is somewhere in the multipath which is used for the root filesystem...

evverx commented 5 years ago

Regarding https://github.com/systemd/systemd/commit/80df8f2518aa07ef3c328e1c634573347e130cf0, I think mesone -Dnetworkd=false might help to get around it.

yuwata commented 5 years ago

Could you try to boot with udev.children_max=1?

mrc0mmand commented 5 years ago

Thanks a lot for the suggestions, unfortunately neither of them helped. -Dnetworkd=false excludes systemd-networkd from the compilation, but the sd-netlink still causes issues, and the issue still occurs even with udev.children_max=1.

I raised the post-mortem debugging issue on the CentOS CI Users mailing list so let's see if someone will be able to help.

In the meantime I'll play around with bisect in hopes I'll stumble upon the root cause...

mrc0mmand commented 5 years ago

Notes from the "why it doesn't work in CentOS CI infrastructure" session, part 2:

first compilable and bootable commit found by bisecting: systemd/systemd@7692fed98b784be92f900151d867f0be0975e062
it works all the way up to systemd/systemd@53cb501a1314740fa777f145067cefccda954487 where the known issue with compilation occurs, so the naughty commit is somewhere between systemd/systemd@53cb501a1314740fa777f145067cefccda954487 and https://github.com/systemd/systemd/commit/80df8f2518aa07ef3c328e1c634573347e130cf0
using a simple workaround[0] I should be able to bisect the problematic part
and the winner is apparently systemd/systemd@759d9f3f8d07af2940bb3783acc3985ee47adfa5 which makes systemd/systemd@5e1e4c247b5cacfa52eb44f76cfbeac16f30e54a the last working commit

[0] curl -q https://github.com/systemd/systemd/commit/80df8f2518aa07ef3c328e1c634573347e130cf0.patch | git am

evverx commented 5 years ago

@mrc0mmand thank you a lot for finding the offending commit! By the way, apparently GitHub doesn't send notifications when comments are edited so probably major breakthroughs deserve to be written down separately :-)

The easiest way to unbreak CentOS CI would be to revert that commit. @keszybz @poettering @yuwata what do you think? As usual, I agree that it would be much better to figure out what's going on and fix it, but, in this case, it's not that easy and given how long it took to get access to the testing infrastructure I don't think the question @mrc0mmand asked in https://lists.centos.org/pipermail/ci-users/2018-November/000918.html will be answered anytime soon.

mrc0mmand commented 5 years ago

@evverx I was trying to get a remote shell in the initrd to get logs before pinging everyone, and believe me or not I managed to do it using https://github.com/dracut-crypt-ssh/dracut-crypt-ssh! Right now I have a working shell and access to journal and kernel ring buffer, so I'll open an issue shortly with as much logs as I can get.

evverx commented 5 years ago

@mrc0mmand that's great! I'm wondering if it would be possible to use it in the script that reboots and connects to the machine so that in the future issues like this would be a little bit easier to debug. It could just dump all the logs somewhere, which is better than nothing I guess and more or less automatic.

mrc0mmand commented 5 years ago

I guess we could incorporate it into the CI scripts, as the setup is fairly simple.

mrc0mmand commented 5 years ago

The testsuite almost passes, there's some issue with networking, hopefully it's not something major - https://ci.centos.org/job/systemd-pr-build/3673/console

Debug log from systemd-networkd-tests.py: https://paste.fedoraproject.org/paste/jnhwagD3-saGbeCNzYYk0w

@ssahani could you shed some light into what's happening here?

evverx commented 5 years ago

I suspect test-execute is failing because the regular expression doesn't cover all links that can pop up after a new network namespace is created as was discussed in https://github.com/systemd/systemd/pull/10331#discussion_r223745613.

Regarding systemd-networkd-tests.py, could you try stopping dnsmasq before running the test to see if it helps?

mrc0mmand commented 5 years ago

I suspect test-execute is failing because the regular expression doesn't cover all links

That makes sense, thanks for the reference link.

Regarding systemd-networkd-tests.py, could you try stopping dnsmasq before running the test to see if it helps?

Unfortunately not. I even tried rebooting the machine before the test itself, but it still fails the same.

evverx commented 5 years ago

@mrc0mmand could you create a new issue about systemd-networkd-tests.py so that it would be possible to track it properly? This issue is already hard to follow if you ask me :-)

mrc0mmand commented 5 years ago

Tracking issues for current CentOS CI blockers: test-execute - https://github.com/systemd/systemd/issues/10934 systemd-networkd-tests.py - https://github.com/systemd/systemd-centos-ci/issues/23

evverx commented 5 years ago

@mrc0mmand I'm wondering if you have figured out what @systemd-centos-ci is. I think it would make sense to turn CentOS CI on as soon as possible to at least make sure that systemd compiles and the rest of the tests still pass.

systemd-networkd-tests.py seems to be always broken (partly because nobody has ever run it automatically) and can be skipped for now and test-execute (or more precisely exec-privatenetwork-yes.service) isn't exactly useful and can be replaced with something that simply won't fail.

mrc0mmand commented 5 years ago

@evverx IMHO @systemd-centos-ci was created to simply provide an API key for the GitHub builder plugin in the CentOS CI jenkins - this allows jenkins to update commit/PR state according to the results of the test run. However, I don't know who has access to this account, so maybe it would be wise if I just used my API key (with limited permissions), so we have everything under our control.

I'll go ahead and temporarily disable mentioned tests so the results are finally usable.

mrc0mmand commented 5 years ago

Ah, I take that back, I can't use my API key as I don't have appropriate permissions in systemd/systemd. Either we could track down the owner of @systemd-centos-ci or just create a new account for such purpose.

evverx commented 5 years ago

I have no problem with a new account. If I understand correctly, it'll just have to be invited as a collaborator and I can do that. But, as far as know, https://wiki.centos.org/QaWiki/CI/GithubIntegration will no longer be applicable there so it'd be great if you could let me know how the webhook is supposed to look like. Now it just points to https://ci.centos.org/ghprbhook/ with no secret.

evverx commented 5 years ago

@keszybz it would be great it you could help here. Judging by the presence of @systemd-centos-ci I assume there are some unknown to me reasons for it to be here (most likely related to secure access to the repository, but who knows).

evverx commented 5 years ago

In the light of the recent events that shall remain nameless, one can never be too cautious giving write access to the repository :-)

mrc0mmand commented 5 years ago

@evverx Sorry for the delay, wanted to make sure everything works before we start messing with webhooks. I temporarily disabled the problematic parts of the testsuite in 42340c275ab008e7acc4cf5a28a1e79c0e1c3b75 and it's finally passing https://ci.centos.org/job/systemd-pr-build/3676/console.

I guess now we just have to figure out which user to use for the CI, so I can configure it properly on the jenkins side.

evverx commented 5 years ago

@mrc0mmand given that I already bother contributors with LGTM alerts like https://github.com/systemd/systemd/pull/10249#issuecomment-426240185 I think we could use my account as a bearer of bad news (at least temporarily). What do you think?

evverx commented 5 years ago

Though, I'd prefer it if @poettering and @keszybz chimed in here because I'm still not sure whether anyone else is interested in getting it working.

evverx commented 5 years ago

On second thoughts, It also seems reasonable to me to invite @mrc0mmand as a collaborator to the systemd repository and point CentOS CI to @mrc0mmand's handle. I'm pretty sure it'll make everything much faster, simpler and even a little bit more secure.

evverx commented 5 years ago

And systemd is failing to compile on CentOS again: https://github.com/systemd/systemd/issues/11036.

mrc0mmand commented 5 years ago

I guess CentOS CI could have easily prevented that...

As for the ideas above - using your account, @evverx, is definitely possible, but I don't like the idea of being in charge of someone else's API key. Not that I have any ulterior motives, but it's still a responsibility.

evverx commented 5 years ago

@mrc0mmand I'm completely with you on this one that's why I suggested inviting you as a collaborator to the systemd repository. I'd do that right now but I'm not sure I can make decisions like that without at least one ACK. Maybe you could ping someone to speed up the process.

evverx commented 5 years ago

So, manually launching a CentOS VM and running ./agent/bootstrap.sh to see whether https://github.com/systemd/systemd/issues/11036 is gone was the last straw. I'll take the liberty of inviting @mrc0mmand as a collaborator to the systemd repository. I think that making the scripts from this repository usable and resurrecting TravisCI is enough for me to be sure that @mrc0mmand can get things done. Plus apparently @mrc0mmand cares about my API key even more than I do :-)

evverx commented 5 years ago

@mrc0mmand let me know when (and probably how) I should turn the webhook on. 6 hours ago https://ci.centos.org/ghprbhook/ responded with 500 so I turned it off again.

mrc0mmand commented 5 years ago

@evverx will do! However, as usual, there is one small catch, because otherwise things would be too easy... In Jenkins, every user has its credentials store, to manage credentials for various plugins, but, for some reason, I can't manage credentials for the plugin we need (GitHub Pull Request Builder). I just asked about that on the #centos-devel channel, so let's hope for a (relatively) fast response.

evverx commented 5 years ago

@mrc0mmand in case the response won't be fast, I'm wondering if it would be possible as a last resort to trigger CentOS CI via Travis CI. I'm fantasizing here and assuming you have everything you need to run Jenkins jobs that can produce reports like https://ci.centos.org/job/systemd-pr-build/3676/console. In theory could we encrypt your credentials and use them to spawn VMs via agent-control.py and then put a link to the report at the end of the Travis build log?

mrc0mmand commented 5 years ago

@evverx I just gave up and wrote a simple wrapper which does the status reporting and it seems to be working. I'll definitely improve it as soon as possible (or ditch it completely if I figure out the jenkins plugin madness), but for now it should finally start delivering results to PRs.

Right now just setup a webhook according to the CentOS CI documentation, i.e.:

1. On your Github Project page, choose 'Settings'
2. Navigate to the 'Webhooks and Services' tab
3. Choose 'Add a webhook'
4. Select 'Let me select individual events' under 'Which events would you like to trigger this webhook?'
5. Unselect 'Push' and select 'Pull Request' and 'Issue Comment'
6. Paste 'https://ci.centos.org/ghprbhook/' (note the trailing slash) in the Payload URL
7. Open the new webhook and verify the ping the Recent Deliveries section

evverx commented 5 years ago

I didn't select "Issue Comment" because I'm not sure it'd be useful. To judge from https://github.com/systemd/systemd/pull/11045, the hook has started to deliver :-)

mrc0mmand commented 5 years ago

So, thanks to collaboration with Brian we now have a working CentOS CI without workarounds. As the next step I'll sort out artifact exporting, so the logs can be properly investigated in case of failure.

mrc0mmand commented 5 years ago

I was working on the artifact exporting, but stumbled upon an issue with permissions (which got fixed by Brian). In the meanwhile I set up a CI for this repository, so we don't have to manually test every change, see https://github.com/systemd/systemd-centos-ci/pull/24 and https://github.com/systemd/systemd-centos-ci/pull/25.

Hopefully the artifact exporting should be up tomorrow.

mrc0mmand commented 5 years ago

Quick update:

artifacts are now stored directly in Jenkins using the internal artifact machinery (https://github.com/systemd/systemd-centos-ci/pull/27)
the integration test suite under KVM is almost ready to be deployed, one test keeps failing (https://github.com/systemd/systemd/issues/11173)
we got another two Jenkins executors, which should help during review periods (https://bugs.centos.org/view.php?id=15579)
in some cases the machine won't boot after a reboot, needs further investigation with dracut crypt SSH

mrc0mmand commented 5 years ago

I'd say the main goal of this issue was successfully achieved - the CentOS CI is working and delivering results. I'm going to close this issue and move any outstanding issues to a new one, to keep things easier to follow.

systemd / systemd-centos-ci

CentOS CI "get well" plan #18