Closed XieJiSS closed 2 years ago
I can confirm that this error does not occur on x86_64 platforms.
Does this indicate that the test case requires valgrind to success?
No, to the contrary. It's a test which is known to not work under valgrind, so my CI calls the valgrind tests with --no-suite fails-valgrind
to skip these.
Indeed t_detects_not_running_in_testbed
is a bit problematic, it fails in emulators. That's why in commit 0cffcea655 I already disabled it for the RPM architectures which are emulated (s390x and arm).
Debian's and Ubuntu's riscv64 builds have been happy for a long time -- I figure they will use QEMU emulation as well instead of actual hardware. But maybe they are just "fast enough" to make that test pass?
There's most definitively a race condition somewhere, I just didn't track it down yet (as it's hard to reproduce locally). I'm happy to do something similar to commit 0cffcea655, is there an arch counterpart for $RPM_ARCH
? (pun not intended..)
Otherwise I could just generally skip the test on RISCv64, as these days pretty much every builder is emulated still.
@martinpitt
Hi, the Arch Linux build system specifies arch
environment variable while building. For riscv64, its value is riscv64
.
However, I think I'd prefer the second solution, i.e. to skip this test on RISC-V platforms, because I've actually attempted to build umockdev on real riscv64gc
boards (SiFive Unmatched board, with 4*CPU, but it's performance is not so good), and this problem still exists.
It is annoyingly difficult to determine the current CPU architecture in Vala (something like #ifdef __riscv
is not passed through).
I tried to run this test case with really little CPU:
sudo systemd-run -p CPUQuota=1% -dt --wait sh -ec 'LD_LIBRARY_PATH=. LD_PRELOAD=libumockdev-preload.so ./test-umockdev-vala -p /umockdev-testbed-vala/detects_running_outside_testbed'
and even tried 0.1%, but it always passes.
So I'll just mark it as brittle test, so that it gets skipped in package builds.
Thanks for your quick response!
@martinpitt By the way, here at PLCT Lab, we are capable for providing SSH access to SiFive Unmatched boards for debug & CI use. If you are willing to, I can apply for a test account, which should be fulfilled in 5 work days.
@XieJiSS : That would be nice actually, to get to the bottom of this! I don't need root privs, but I do need umockdev's build dependencies installed (in Debian/Ubuntu/Fedora etc. one would usually use schroot/mock or a podman container). My public SSH key.
Thanks!
Released as 0.17.6
@martinpitt Hi, I've send you an email with ssh configs needed to access the RISC-V board.
Reopening -- while this does not block package builds any more, I'd still like to investigate what makes the test actually fail.
So far I built umockdev on @XieJiSS 's riscv64 machine, and the test works just fine. Both in default parallel as well as in --num-processes 1
mode. I'll try a bit harder to run it in a loop, with some background activity and/or lowering CPU quota, etc.
I ran the test 1000 times in a row:
for i in `seq 1000`; do LD_LIBRARY_PATH=b LD_PRELOAD=b/libumockdev-preload.so b/test-umockdev-vala -p /umockdev-testbed-vala/detects_running_outside_testbed || break; done
I then ran the same loop 5 times in parallel as well. I stared at the code, and I really can't see a race condition there -- the two processes are properly synchronized with writing the byte through the pipe. It's just simply delivering the wrong result.
I am reasonably sure that this is not a race condition, but a failure that happens under some particular build conditions which are not met by my plain meson build or the Debian/Ubuntu packages. I guess I'll have to litter this thing with printfs and throw it at the Fedora koji builders, which seem to exhibit the same behaviour (at least when I checked last)
The failure is gone from Koji's s390x build, presumably because that architecture gets built on real hardware now. But armhfp is still emulated, and it still fails there.
However, the failure is literally 0 == 1
, but the actual result of "in testbed" gets returned as char '0'
or '1'
, so this is actually the failure of this:
assert_cmpint ((int) Posix.read(pipefds[0], &buf, 1), CompareOperator.EQ, 1);
that smells like a non-blocking read, or the pipe getting closed prematurely? But the manpage clearly says
The O_NONBLOCK and FD_CLOEXEC flags shall be clear on both file descriptors.
I added lots of debugging logs now. Locally, when it works:
# DEBUG: test-umockdev-vala.vala:810: XXX t_detects_not_running_in_testbed: parent pid 9518, waiting for pipe read; flags 0
# DEBUG: test-umockdev-vala.vala:806: XXX t_detects_not_running_in_testbed: child pid 9519, execing myself with --test-outside-testbed 5
** (b/test-umockdev-vala:9518): DEBUG: 10:48:36.540: test-umockdev-vala.vala:810: XXX t_detects_not_running_in_testbed: parent pid 9518, waiting for pipe read; flags 0
** (b/test-umockdev-vala:9519): DEBUG: 10:48:36.540: test-umockdev-vala.vala:806: XXX t_detects_not_running_in_testbed: child pid 9519, execing myself with --test-outside-testbed 5
** (process:9519): DEBUG: 10:48:36.543: test-umockdev-vala.vala:1076: XXX main pid 9519, called with --test-outside-testbed: calling is_test_inside_testbed on fd 5
** (process:9519): DEBUG: 10:48:36.543: test-umockdev-vala.vala:830: XXX is_test_inside_testbed: reply 0, writing to pipe
** (process:9519): DEBUG: 10:48:36.544: test-umockdev-vala.vala:833: XXX is_test_inside_testbed: reply 0, written to pipe successfully
# DEBUG: test-umockdev-vala.vala:816: XXX t_detects_not_running_in_testbed: parent pid 9518, read byte from pipe 48
** (b/test-umockdev-vala:9518): DEBUG: 10:48:36.544: test-umockdev-vala.vala:816: XXX t_detects_not_running_in_testbed: parent pid 9518, read byte from pipe 48
ok 1 /umockdev-testbed-vala/detects_running_outside_testbed
This confirms that the pipe fds have flags "0", i.e. neither CLOEXEC nor NONBLOCK.
On failed koji:
# DEBUG: test-umockdev-vala.vala:810: XXX t_detects_not_running_in_testbed: parent pid 3445, waiting for pipe read; flags 0
# GLib-DEBUG: unsetenv() is not thread-safe and should not be used after threads are created
# DEBUG: test-umockdev-vala.vala:806: XXX t_detects_not_running_in_testbed: child pid 3517, execing myself with --test-outside-testbed 56
Bail out! ERROR:test-umockdev-vala.p/tests/test-umockdev-vala.c:2603:t_detects_not_running_in_testbed: assertion failed ((gint) read (_tmp8_, &buf, (gsize) 1) == 1): (0 == 1)
stderr:
**
ERROR:test-umockdev-vala.p/tests/test-umockdev-vala.c:2603:t_detects_not_running_in_testbed: assertion failed ((gint) read (_tmp8_, &buf, (gsize) 1) == 1): (0 == 1)
So the exec'ing of itself just fails. Again, this is not a race condition -- if I add a sleep, it still works fine:
--- tests/test-umockdev-vala.vala
+++ tests/test-umockdev-vala.vala
@@ -804,6 +804,7 @@ t_detects_not_running_in_testbed ()
GLib.Environment.unset_variable("LD_PRELOAD");
string[] argv = { "--test-outside-testbed", pipefds[1].to_string() };
debug ("XXX t_detects_not_running_in_testbed: child pid %u, execing myself with --test-outside-testbed %i", Posix.getpid(), pipefds[1]);
+ Posix.sleep (1);
Posix.execv("/proc/self/exe", argv);
error ("execv /proc/self/exe failed: %m");
}
So let's validate what /proc/self/exec actually is -- possibly it's QEMU in the emulated case?
Locally:
# DEBUG: test-umockdev-vala.vala:807: XXX t_detects_not_running_in_testbed: child pid 10531, execing self /var/home/martin/upstream/umockdev/b/test-umockdev-vala with --test-outside-testbed 5
and on koji:
# DEBUG: test-umockdev-vala.vala:807: XXX t_detects_not_running_in_testbed: child pid 3517, execing self /builddir/build/BUILD/umockdev-0.17.6/redhat-linux-build/test-umockdev-vala with --test-outside-testbed 55
So this is okay -- it's not QEMU as I suspected.
@XieJiSS : How and where exactly are you building umockdev to trigger the failure? So far there is overwhelming evidence that this only happens in a QEMU emulated build. So far I've been unable to reproduce this on your board with the standard meson b && meson test -C b
.
I added a 30 times/seconds retry loop, and it
does not help -- the execve() does not fail, but it never actually gets to the --test-outside-testbed
evaulation. I added another debug right at the top of main(), and suspiciously the log contains neither the top-level nor the fork/exec'ed line.
But now I finally found a bug, and it seems to fix the problem :tada: See PR #172.
pkgver=0.17.5
I notice that the failing test case is named
umockdev:fails-valgrind / umockdev-vala
. Does this indicate that the test case requires valgrind to success? Currently, valgrind is not usable on RISC-V architecture.Logs attached:
Related full log: