grml / grml-debootstrap

wrapper around debootstrap
59 stars 27 forks source link

Github Actions - Workflow test-build : 10 errors and 39 warnings #278

Closed jkirk closed 3 weeks ago

jkirk commented 3 months ago

We get 10 errors and 39 warnings in the Github Workflow test-build:

The 10 errors come from:

test-debian (trixie, bullseye, mmdebstrap)
Process completed with exit code 1.
test-debian (trixie, stretch)
Process completed with exit code 1.
test-debian (trixie, buster)
Process completed with exit code 1.
test-debian (unstable, stretch)
Process completed with exit code 1.
test-debian (trixie, bullseye)
Process completed with exit code 1.
test-debian (unstable, bullseye, mmdebstrap)
Process completed with exit code 1.
test-debian (unstable, bullseye)
Process completed with exit code 1.
test-debian (trixie, buster, mmdebstrap)
Process completed with exit code 1.
test-debian (unstable, buster, mmdebstrap)
Process completed with exit code 1.
test-debian (unstable, buster) 

I took a closer look into the logs of test-debian (unstable, buster) and saw the following:

Run ./tests/build-vm-and-test.sh test
+ '[' '!' -d ./tests ']'
+ '[' test == setup ']'
+ RELEASE=buster
+ TARGET=qemu.img
+ DEBOOTSTRAP=
[...]
+ qemu-system-x86_64 -hda /home/runner/work/grml-debootstrap/grml-debootstrap/qemu.img -m 2048 -display none -vnc :0 -virtfs local,path=/tmp/tmp.viVk190zWP,mount_tag=host0,security_model=none,id=host0 -serial pty
+ echo 'No serial console from Qemu found yet [29 retries left]'
No serial console from Qemu found yet [29 retries left]
[...]
+ /home/runner/work/grml-debootstrap/grml-debootstrap/tests/serial-console-connection --tries 180 --port /dev/pts/0 --hostname buster --poweroff 'mount -t 9p -o trans=virtio,version=9p2000.L,rw host0 /mnt && cd /mnt && ./testrunner'
Login failure (try 0): Timeout exceeded.
[...]
Login failure (try 101): Timeout exceeded.
<pexpect.fdpexpect.fdspawn object at 0x7f7222758e50>
searcher: searcher_re:
    0: re.compile(b'buster login:')
Logging into /dev/pts/0 via serial console [try 0]
Waiting for login prompt...
Logging into /dev/pts/0 via serial console [try 1]
Waiting for login prompt...
Logging into /dev/pts/0 via serial console [try 2]
[...]
Logging into /dev/pts/0 via serial console [try 101]
Waiting for login prompt...
Login failure (try 102): Timeout exceeded.
<pexpect.fdpexpect.fdspawn object at 0x7f72227591b0>
searcher: searcher_re:
    0: re.compile(b'buster login:')
[...]
Logging into /dev/pts/0 via serial console [try 178]
Waiting for login prompt...
Logging into /dev/pts/0 via serial console [try 179]
Waiting for login prompt...
Error: Process completed with exit code 1.

I checked the logs of the successful run of test-debian (unstable, bookworm) for comparison:

Run ./tests/build-vm-and-test.sh test
+ '[' '!' -d ./tests ']'
+ '[' test == setup ']'
+ RELEASE=bookworm
+ TARGET=qemu.img
+ DEBOOTSTRAP=
[...]
+ qemu-system-x86_64 -hda /home/runner/work/grml-debootstrap/grml-debootstrap/qemu.img -m 2048 -display none -vnc :0 -virtfs local,path=/tmp/tmp.mnz6K2BXpF,mount_tag=host0,security_model=none,id=host0 -serial pty
+ '[' 30 -gt 0 ']'
+ (( timeout-- ))
+ grep -q 'char device redirected to ' qemu.log
+ echo 'No serial console from Qemu found yet [29 retries left]'
No serial console from Qemu found yet [29 retries left]
[...]
+ /home/runner/work/grml-debootstrap/grml-debootstrap/tests/serial-console-connection --tries 180 --port /dev/pts/0 --hostname bookworm --poweroff 'mount -t 9p -o trans=virtio,version=9p2000.L,rw host0 /mnt && cd /mnt && ./testrunner'
Login failure (try 0): Timeout exceeded.
<pexpect.fdpexpect.fdspawn object at 0x7fe70fab0520>
searcher: searcher_re:
    0: re.compile(b'bookworm login:')
Login failure (try 1): Timeout exceeded.
<pexpect.fdpexpect.fdspawn object at 0x7fe70fab0130>
searcher: searcher_re:
    0: re.compile(b'bookworm login:')
Login failure (try 2): Timeout exceeded.
<pexpect.fdpexpect.fdspawn object at 0x7fe70fab08b0>
searcher: searcher_re:
    0: re.compile(b'bookworm login:')
Logging into /dev/pts/0 via serial console [try 0]
Waiting for login prompt...
Logging into /dev/pts/0 via serial console [try 1]
Waiting for login prompt...
Logging into /dev/pts/0 via serial console [try 2]
Waiting for login prompt...
Logging into /dev/pts/0 via serial console [try 3]
Waiting for login prompt...
Logging in...
>> root
>> grml
Waiting for shell prompt...
>> 
<< b'\r\n'
<< b'\x1b[?2004l\r\x1b[?2004hroot@bookworm:~# '
Running commands...
[...]
+ EXIT_CODE=0
+ exit 0

It seems that there is a problem with the QEMU image (with the given HOST / RELEASE / DEBOOTSTRAP combination). I am not aware of any internals, but we might also hit some kind of (Github) qemu build limits (we spawn 34 workflows and QEMU instances). @zeha Do you have an idea, what the problem might be?

The 39 warnings basically come from actions/download-artifact@v3 + actions/upload-artifact@v3:

Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/download-artifact@v3, actions/upload-artifact@v3. For more information see: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/.

So, we just need to migrate to actions/download-artifact@v4 + actions/upload-artifact@v4.

I can try to test that migration and can come up with a PR if it works.

mika commented 4 weeks ago

We get 10 errors and 39 warnings in the Github Workflow test-build:

[...]

I took a closer look into the logs of test-debian (unstable, buster) and saw the following:

[...]

It seems that there is a problem with the QEMU image (with the given HOST / RELEASE / DEBOOTSTRAP combination). I am not aware of any internals, but we might also hit some kind of (Github) qemu build limits (we spawn 34 workflows and QEMU instances). @zeha Do you have an idea, what the problem might be?

Exactly, it seems to be failing for Debian testing AKA trixie and unstable, where we should have qemu v1:9.0.2+ds-2.

Would be interesting to run this inside a plain Debian unstable system and try to reproduce it locally.

The 39 warnings basically come from actions/download-artifact@v3 + actions/upload-artifact@v3:

Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/download-artifact@v3, actions/upload-artifact@v3. For more information see: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/.

So, we just need to migrate to actions/download-artifact@v4 + actions/upload-artifact@v4.

I can try to test that migration and can come up with a PR if it works.

I took care of this in https://github.com/grml/grml-debootstrap/pull/281

mika commented 3 weeks ago

I tried to reproduce this issue locally, but didn't manage to do so yet. :-/

I set up a plain Debian/bookworm VM (using debian-installer), and executed:

sudo apt install git docker.io
git clone https://github.com/grml/grml-debootstrap
cd grml-debootstrap
sudo ./tests/docker-build-deb.sh --autobuild 01
sudo ./tests/build-vm-and-test.sh setup
sudo ./tests/build-vm-and-test.sh run
sudo ./tests/build-vm-and-test.sh test

Worked fine and reported exit code 0. Then upgraded the system to Debian/unstable and re-used the qemu.img, still worked fine. Also recreated the qemu.img once again to ensure I don't overlook anything, still worked fine.

FTR, qemu-utils versions 1:7.2+dfsg-7+deb12u6 + 1:9.0.2+ds-4 works fine, and kernel versions 6.1.0-23-amd64 (bookworm) + 6.10.4-amd64 (unstable). And qemu-system-x86 v1:9.0.2+ds-4 is also working for me.

What I had to apply locally to get it working though was:

--- tests/build-vm-and-test.sh
+++ tests/build-vm-and-test.sh
@@ -32,9 +32,9 @@ if [ ! -d ./tests ]; then
 fi

 if [ "$1" == "setup" ]; then
-  [ -x ./tests/goss ] || curl -fsSL https://goss.rocks/install | GOSS_DST="$(pwd)/tests" sh
   sudo apt-get update
-  sudo apt-get -qq -y install qemu-system-x86 kpartx python3-pexpect python3-serial
+  sudo apt-get -qq -y install curl qemu-system-x86 kpartx python3-pexpect python3-serial
+  [ -x ./tests/goss ] || curl -fsSL https://goss.rocks/install | GOSS_DST="$(pwd)/tests" sh
   # TODO: docker.io
   exit 0
 fi

--- tests/serial-console-connection
+++ tests/serial-console-connection
@@ -1,4 +1,4 @@
-#!/usr/bin/env python
+#!/usr/bin/env python3
 import argparse
 import pexpect
[...]

I will provide those changes together with a README.md for the tests usage through another PR, but don't yet see how those could be relevant here.

AFAICS we need to run the CI builds in more verbose mode (and maybe brute force the connection after 10 tries or so), and also reduce the serial-console-connection --tries 180 invocation to something smaller (like 20 or so), to reduce the build time until it's failing (also reducing our github runner minutes which is limited with the free account).

zeha commented 3 weeks ago

and also reduce the serial-console-connection --tries 180 invocation to something smaller (like 20 or so),

I guess in serial-console-connection we could implement a simple time limit. If it cant connect after 5 minutes, can just abort.

mika commented 3 weeks ago

JFTR (so no one else looks into this): @zeha and myself looked into this and we know what's the actual issue, we'll provide improved tests as well as fix for the actual underlying issue. Thanks @zeha! \o/

zeha commented 3 weeks ago

and also reduce the serial-console-connection --tries 180 invocation to something smaller (like 20 or so),

I guess in serial-console-connection we could implement a simple time limit. If it cant connect after 5 minutes, can just abort.

This was implemented in #238.