Closed mrc0mmand closed 5 years ago
* [ ] (?) Use qemu-kvm instead of systemd-nspawn (or both?) [slave/testsuite.sh]
@keszybz any thoughts on this one?
Failing tests:
Common error:
+ env --unset=UNIFIED_CGROUP_HIERARCHY /root/systemd-centos-ci/systemd/build/systemd-nspawn -U --private-network --register=no --kill-signal=SIGKILL --directory=/var/tmp/systemd-test.UnjLx7/unprivileged-nspawn-root /usr/lib/systemd/systemd
Spawning container unprivileged-nspawn-root on /var/tmp/systemd-test.UnjLx7/unprivileged-nspawn-root.
Press ^] three times within 1s to kill container.
Selected user namespace base 970129408 and range 65536.
Failed to fork inner child: Invalid argument
E: nspawn failed with exit code 1
-rw-r-----+ 1 root systemd-journal 8388608 Oct 21 15:07 /var/tmp/systemd-test.UnjLx7/journal/32e86a2de4e543fe8c41793961c76987/system.journal
make: *** [run] Error 1
make: Leaving directory `/root/systemd-centos-ci/systemd/test/
--x-- Result of TEST-01-BASIC: 2 --x--
This happens even with user_namespace.enable=1
[root@host-8-251-180 systemd]# tr ' ' '\n' </proc/cmdline | grep user_namespace
user_namespace.enable=1
[root@host-8-251-180 systemd]# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-3.10.0-862.14.4.el7.x86_64 root=UUID=40735eda-bc43-4610-961f-bc5c0353239a ro console=tty0 console=ttyS0,115200 crashkernel=auto net.ifnames=0 rhgb quiet LANG=en_US.UTF-8 user_namespace.enable=1
Workaround/fix:
# echo 10000 > /proc/sys/user/max_user_namespaces
Persisting issues: Fixed by installing missing dependencies (quota, net-tools)
# make -C test/TEST-22-TMPFILES/ setup
...
+ for _x in inst_symlink inst_script inst_binary inst_simple
+ inst_simple ldconfig.real
+ [[ -f ldconfig.real ]]
+ return 1
+ return 1
+ [[ yes = yes ]]
+ dinfo 'Skipping program ldconfig.real as it cannot be found and is' 'flagged to be optional'
+ set +x
I: Skipping program ldconfig.real as it cannot be found and is flagged to be optional
make: *** [setup] Error 1
make: Leaving directory `/root/systemd-centos-ci/systemd/test/TEST-22-TMPFILES'
...
- [ ] (?) Use qemu-kvm instead of systemd-nspawn (or both?) [slave/testsuite.sh]
"Both" is worthwhile, because different things are tested in both environments. But reliability is more important than having both, so if just one can be made to work, that's better than having flaky tests.
Notes from the "make the QEMU testsuite work again" session:
INITRD
and KERNEL_IMG
env vars (/boot/initramfs-$(uname -r).img
and /boot/vmlinuz-$(uname -r)
respectively)/etc/fstab
(imo), which is xfs
for default CentOS installation. However, the QEMU testsuite uses ext4 filesystem, which results is a boot failure for the respective virtual machine (dracut -f --filesystems ext4
for the rescue)/usr/bin/qemu-kvm
symlink (to /usr/libexec/qemu-kvm
). Maybe there's a nicer way@evverx With the help of several other people I finally got something, which could get things moving again - I'm going to propose this ticket at CentOS CBS meeting (every Monday, 2 PM UTC in #centos-devel@Freenode) and hopefully it will get us somewhere.
Apparently there was some error in communication, so I didn't receive the previous email. However I finally got the credentials, so we can start breaking things!
(OT: Is there any chat to catch you in (e.g. IRC, Telegram, etc.)? @evverx)
However I finally got the credentials, so we can start breaking things!
That's great news! Congratulations!
Is there any chat to catch you in (e.g. IRC, Telegram, etc.)?
I'm afraid it isn't possible to catch me there, but, on the positive side, I usually reply to comments on GitHub relatively fast.
Notes from the "why it doesn't work in CentOS CI infrastructure" session:
dracut -f --regenerate-all
with the downstream packageCould it be that you ran into https://github.com/systemd/systemd/issues/10854? There're two PRs that are supposed to fix the issue. Could you try applying one of them to see if it works?
If the failure is caused by https://github.com/systemd/systemd/issues/10854, then please provide any logs or something if possible. Thank you.
Another possibility is https://github.com/systemd/systemd/issues/10754...
Unfortunately, neither mentioned issue seems to be relevant for this case. I did a quick bisect, but the issue occurs all the way down to https://github.com/systemd/systemd/commit/80df8f2518aa07ef3c328e1c634573347e130cf0 - without this commit the systemd won't compile, will try to workaround it tomorrow.
Also, I'll try to ask for some possibility to get any useful logs from the machine after it dies.
Anyway, in my opinion, the issue is somewhere in the multipath which is used for the root filesystem...
Regarding https://github.com/systemd/systemd/commit/80df8f2518aa07ef3c328e1c634573347e130cf0, I think mesone -Dnetworkd=false
might help to get around it.
Could you try to boot with udev.children_max=1
?
Thanks a lot for the suggestions, unfortunately neither of them helped. -Dnetworkd=false
excludes systemd-networkd from the compilation, but the sd-netlink still causes issues, and the issue still occurs even with udev.children_max=1
.
I raised the post-mortem debugging issue on the CentOS CI Users mailing list so let's see if someone will be able to help.
In the meantime I'll play around with bisect in hopes I'll stumble upon the root cause...
Notes from the "why it doesn't work in CentOS CI infrastructure" session, part 2:
[0]
curl -q https://github.com/systemd/systemd/commit/80df8f2518aa07ef3c328e1c634573347e130cf0.patch | git am
@mrc0mmand thank you a lot for finding the offending commit! By the way, apparently GitHub doesn't send notifications when comments are edited so probably major breakthroughs deserve to be written down separately :-)
The easiest way to unbreak CentOS CI would be to revert that commit. @keszybz @poettering @yuwata what do you think? As usual, I agree that it would be much better to figure out what's going on and fix it, but, in this case, it's not that easy and given how long it took to get access to the testing infrastructure I don't think the question @mrc0mmand asked in https://lists.centos.org/pipermail/ci-users/2018-November/000918.html will be answered anytime soon.
@evverx I was trying to get a remote shell in the initrd to get logs before pinging everyone, and believe me or not I managed to do it using https://github.com/dracut-crypt-ssh/dracut-crypt-ssh! Right now I have a working shell and access to journal and kernel ring buffer, so I'll open an issue shortly with as much logs as I can get.
@mrc0mmand that's great! I'm wondering if it would be possible to use it in the script that reboots and connects to the machine so that in the future issues like this would be a little bit easier to debug. It could just dump all the logs somewhere, which is better than nothing I guess and more or less automatic.
I guess we could incorporate it into the CI scripts, as the setup is fairly simple.
The testsuite almost passes, there's some issue with networking, hopefully it's not something major - https://ci.centos.org/job/systemd-pr-build/3673/console
Debug log from systemd-networkd-tests.py: https://paste.fedoraproject.org/paste/jnhwagD3-saGbeCNzYYk0w
@ssahani could you shed some light into what's happening here?
I suspect test-execute
is failing because the regular expression doesn't cover all links that can pop up after a new network namespace is created as was discussed in https://github.com/systemd/systemd/pull/10331#discussion_r223745613.
Regarding systemd-networkd-tests.py
, could you try stopping dnsmasq
before running the test to see if it helps?
I suspect
test-execute
is failing because the regular expression doesn't cover all links
That makes sense, thanks for the reference link.
Regarding
systemd-networkd-tests.py
, could you try stoppingdnsmasq
before running the test to see if it helps?
Unfortunately not. I even tried rebooting the machine before the test itself, but it still fails the same.
@mrc0mmand could you create a new issue about systemd-networkd-tests.py
so that it would be possible to track it properly? This issue is already hard to follow if you ask me :-)
Tracking issues for current CentOS CI blockers:
test-execute
- https://github.com/systemd/systemd/issues/10934
systemd-networkd-tests.py
- https://github.com/systemd/systemd-centos-ci/issues/23
@mrc0mmand I'm wondering if you have figured out what @systemd-centos-ci is. I think it would make sense to turn CentOS CI on as soon as possible to at least make sure that systemd
compiles and the rest of the tests still pass.
systemd-networkd-tests.py
seems to be always broken (partly because nobody has ever run it automatically) and can be skipped for now and test-execute
(or more precisely exec-privatenetwork-yes.service
) isn't exactly useful and can be replaced with something that simply won't fail.
@evverx IMHO @systemd-centos-ci was created to simply provide an API key for the GitHub builder plugin in the CentOS CI jenkins - this allows jenkins to update commit/PR state according to the results of the test run. However, I don't know who has access to this account, so maybe it would be wise if I just used my API key (with limited permissions), so we have everything under our control.
I'll go ahead and temporarily disable mentioned tests so the results are finally usable.
Ah, I take that back, I can't use my API key as I don't have appropriate permissions in systemd/systemd. Either we could track down the owner of @systemd-centos-ci or just create a new account for such purpose.
I have no problem with a new account. If I understand correctly, it'll just have to be invited as a collaborator and I can do that. But, as far as know, https://wiki.centos.org/QaWiki/CI/GithubIntegration will no longer be applicable there so it'd be great if you could let me know how the webhook is supposed to look like. Now it just points to https://ci.centos.org/ghprbhook/ with no secret.
@keszybz it would be great it you could help here. Judging by the presence of @systemd-centos-ci I assume there are some unknown to me reasons for it to be here (most likely related to secure access to the repository, but who knows).
In the light of the recent events that shall remain nameless, one can never be too cautious giving write access to the repository :-)
@evverx Sorry for the delay, wanted to make sure everything works before we start messing with webhooks. I temporarily disabled the problematic parts of the testsuite in 42340c275ab008e7acc4cf5a28a1e79c0e1c3b75 and it's finally passing https://ci.centos.org/job/systemd-pr-build/3676/console.
I guess now we just have to figure out which user to use for the CI, so I can configure it properly on the jenkins side.
@mrc0mmand given that I already bother contributors with LGTM alerts like https://github.com/systemd/systemd/pull/10249#issuecomment-426240185 I think we could use my account as a bearer of bad news (at least temporarily). What do you think?
Though, I'd prefer it if @poettering and @keszybz chimed in here because I'm still not sure whether anyone else is interested in getting it working.
On second thoughts, It also seems reasonable to me to invite @mrc0mmand as a collaborator to the systemd repository and point CentOS CI to @mrc0mmand's handle. I'm pretty sure it'll make everything much faster, simpler and even a little bit more secure.
And systemd
is failing to compile on CentOS again: https://github.com/systemd/systemd/issues/11036.
I guess CentOS CI could have easily prevented that...
As for the ideas above - using your account, @evverx, is definitely possible, but I don't like the idea of being in charge of someone else's API key. Not that I have any ulterior motives, but it's still a responsibility.
@mrc0mmand I'm completely with you on this one that's why I suggested inviting you as a collaborator to the systemd repository. I'd do that right now but I'm not sure I can make decisions like that without at least one ACK. Maybe you could ping someone to speed up the process.
So, manually launching a CentOS VM and running ./agent/bootstrap.sh
to see whether https://github.com/systemd/systemd/issues/11036 is gone was the last straw. I'll take the liberty of inviting @mrc0mmand as a collaborator to the systemd repository. I think that making the scripts from this repository usable and resurrecting TravisCI is enough for me to be sure that @mrc0mmand can get things done. Plus apparently @mrc0mmand cares about my API key even more than I do :-)
@mrc0mmand let me know when (and probably how) I should turn the webhook on. 6 hours ago https://ci.centos.org/ghprbhook/ responded with 500 so I turned it off again.
@evverx will do! However, as usual, there is one small catch, because otherwise things would be too easy... In Jenkins, every user has its credentials store, to manage credentials for various plugins, but, for some reason, I can't manage credentials for the plugin we need (GitHub Pull Request Builder). I just asked about that on the #centos-devel channel, so let's hope for a (relatively) fast response.
@mrc0mmand in case the response won't be fast, I'm wondering if it would be possible as a last resort to trigger CentOS CI via Travis CI. I'm fantasizing here and assuming you have everything you need to run Jenkins jobs that can produce reports like https://ci.centos.org/job/systemd-pr-build/3676/console. In theory could we encrypt your credentials and use them to spawn VMs via agent-control.py
and then put a link to the report at the end of the Travis build log?
@evverx I just gave up and wrote a simple wrapper which does the status reporting and it seems to be working. I'll definitely improve it as soon as possible (or ditch it completely if I figure out the jenkins plugin madness), but for now it should finally start delivering results to PRs.
Right now just setup a webhook according to the CentOS CI documentation, i.e.:
1. On your Github Project page, choose 'Settings'
2. Navigate to the 'Webhooks and Services' tab
3. Choose 'Add a webhook'
4. Select 'Let me select individual events' under 'Which events would you like to trigger this webhook?'
5. Unselect 'Push' and select 'Pull Request' and 'Issue Comment'
6. Paste 'https://ci.centos.org/ghprbhook/' (note the trailing slash) in the Payload URL
7. Open the new webhook and verify the ping the Recent Deliveries section
I didn't select "Issue Comment" because I'm not sure it'd be useful. To judge from https://github.com/systemd/systemd/pull/11045, the hook has started to deliver :-)
So, thanks to collaboration with Brian we now have a working CentOS CI without workarounds. As the next step I'll sort out artifact exporting, so the logs can be properly investigated in case of failure.
I was working on the artifact exporting, but stumbled upon an issue with permissions (which got fixed by Brian). In the meanwhile I set up a CI for this repository, so we don't have to manually test every change, see https://github.com/systemd/systemd-centos-ci/pull/24 and https://github.com/systemd/systemd-centos-ci/pull/25.
Hopefully the artifact exporting should be up tomorrow.
Quick update:
I'd say the main goal of this issue was successfully achieved - the CentOS CI is working and delivering results. I'm going to close this issue and move any outstanding issues to a new one, to keep things easier to follow.
Purpose of this issue is to keep track of things which need to be done to make systemd CentOS CI work again.
Following things still need to be done:
user_namespace.enable=1
to the kernel cmdline and setuser.max_user_namespaces
> 0Long term goals:
Notes: