ostreedev / ostree

Operating system and container binary deployment and upgrades
https://ostreedev.github.io/ostree/
Other
1.28k stars 295 forks source link

tests/test-sysroot.js intermittently failing on s390x #2527

Open smcv opened 2 years ago

smcv commented 2 years ago

The unit test tests/test-sysroot.js seems to be intermittently failing on Debian's s390x port since October (2021.5). It doesn't always fail, and after failing, it consistently succeeds when the build is retried.

This is happening in a transient chroot environment on autobuilders that are not accessible to ordinary Debian developers, so I am unable to get any information about the failed builds beyond what's in the logs.

The failing assertion is this one:

/// TEST: We can delete the deployment, going back to empty
sysroot.write_deployments([], null);

print("OK empty deployments");

assertEquals(deploymentPath.query_exists(null), false);

I have never had any success with taking s390x-specific issues to Debian's s390x architecture porting team (which might in fact not contain any people), but I hear several ostree developers now work for an IBM subsidiary, so perhaps someone there is better-placed than me to know about s390x-specific issues or see whether this is reproducible in a development environment?

We've seen this with gjs 1.68.4 and 1.70.0. Full logs for some recent versions: 2022.1, 2021.6

dbnicholson commented 2 years ago

That's interesting. So, the call to ostree_sysroot_write_deployments succeeds, but it's either not cleaning up the old deployments or g_file_query_exists is lying. Or maybe deploymentPath isn't what's expected?

Pursuing the g_file_query_exists is lying angle, it appears that GIO uses either statx or lstat preferring statx if it was available at build time. Perhaps 2021.5 is when statx started being used in GIO and it's flaky on the s390x builder? A way to cross check is to use g_file_test, which uses access to test existence. You could try adding this to the test:

diff --git a/tests/test-sysroot.js b/tests/test-sysroot.js
index d4f67ef4..d9a78dc3 100755
--- a/tests/test-sysroot.js
+++ b/tests/test-sysroot.js
@@ -93,6 +93,8 @@ sysroot.write_deployments([], null);

 print("OK empty deployments");

+print("Deployment path: " + deploymentPath.get_path());
+assertEquals(GLib.file_test(deploymentPath.get_path(), GLib.FileTest.EXISTS), false);
 assertEquals(deploymentPath.query_exists(null), false);

 //// Ok, redeploy, then add a new revision upstream and pull it

And here's a hack to get a little more info about cleaning up deployments:

diff --git a/src/libostree/ostree-sysroot-cleanup.c b/src/libostree/ostree-sysroot-cleanup.c
index 3471cac7..9ca7fcc6 100644
--- a/src/libostree/ostree-sysroot-cleanup.c
+++ b/src/libostree/ostree-sysroot-cleanup.c
@@ -325,8 +325,12 @@ cleanup_old_deployments (OstreeSysroot       *self,
       g_autofree char *deployment_path = ostree_sysroot_get_deployment_dirpath (self, deployment);

       if (g_hash_table_lookup (active_deployment_dirs, deployment_path))
-        continue;
+        {
+          g_print ("Skipping cleanup of active deployment %s\n", deployment_path);
+          continue;
+        }

+      g_print ("Cleaning up deployment %s\n", deployment_path);
       if (!_ostree_sysroot_rmrf_deployment (self, deployment, cancellable, error))
         return FALSE;
     }
smcv commented 2 years ago

Debian is currently rebuilding half the archive to recover from a binutils regression, so I am probably not going to be able to test this until the autobuilders recover, sorry.

Because this is intermittent, I can't know that a successful build is really a success, and because the autobuilders are production infrastructure, I can't just keep hitting rebuild. I'll try doing manual builds on a s390x "porter box" when I get a chance, but there's no guarantee that that will match the autobuilder's behaviour.

Perhaps 2021.5 is when statx started being used in GIO and it's flaky on the s390x builder?

Use of statx seems to have been new in 2.66.x, and we had several consecutive successful builds of ostree on s390x after 2.66.x was introduced, so I think it's probably not that... but because it's intermittent, I can't be sure.

That's interesting. So, the call to ostree_sysroot_write_deployments succeeds, but it's either not cleaning up the old deployments or g_file_query_exists is lying. Or maybe deploymentPath isn't what's expected?

In some older builds, like 2021.1-1, we seem to have had other tests failing when they asserted that a directory should not exist, but it did - and those assertions were in shell scripts using test -d, so probably not statx? (But I don't know, maybe bash genuinely does use statx for builtins.)

dbnicholson commented 2 years ago

Oh, I didn't mean to try to debug it right now. I can imagine that s390x debugging is nowhere near the top of your queue. Just that if you do get around to it, it would be helpful to try to narrow down the issue.

nikita-dubrovskii commented 2 years ago

Hi all, i've tried make && make check many times on Fedora35:

============================================================================ Testsuite summary for libostree 2022.2

TOTAL: 1005

PASS: 962

SKIP: 43

XFAIL: 0

FAIL: 0

XPASS: 0

ERROR: 0

smcv commented 1 year ago

I was unable to reproduce this on the Debian-developer-accessible s390x that is meant to be the closest thing there is to being able to access an autobuilder interactively (build + tests succeeded in 2/2 attempts), but 2022.7 failed in this way on 3/3 attempts on Debian's official s390x autobuilders, so there might be something about Debian's official autobuilder infrastructure that makes this test more likely to fail.

My ability to debug that is extremely limited, because only sysadmins have any sort of interactive access to the autobuilder machines, so this is unlikely to go further without someone else picking this up.