quarkusio / quarkus

Quarkus: Supersonic Subatomic Java.
https://quarkus.io
Apache License 2.0
13.8k stars 2.68k forks source link

Possible race condition in build process/bootstrap package #41767

Open natechols opened 4 months ago

natechols commented 4 months ago

Describe the bug

Originally reported as a Keycloak bug: https://github.com/keycloak/keycloak/issues/29758 Keycloak 24.0.5 comes with pre-packaged with Quarkus 3.8.4. On our systems, in the initial build step when Keycloak starts, it chokes on a bit of NFS detritus and fails: ERROR: io.quarkus.builder.BuildException: Build failure: Build failed due to errors [error]: Build step io.quarkus.deployment.pkg.steps.JarResultBuildStep#buildRunnerJar threw an exception: java.lang.IllegalStateException: java.nio.file.FileSystemException: /nfs/home/nat/keycloak/keycloak-24.0.4/lib/quarkus/.nfs804a7a64040362c3000c81f2: Device or resource busy

The detritus files do not remain around after the JVM exits - they're definitely created by the Keycloak app. The apparent failure point is here: https://github.com/quarkusio/quarkus/blob/3.8.4/independent-projects/bootstrap/app-model/src/main/java/io/quarkus/bootstrap/util/IoUtils.java#L109-L117

I tried debugging the problem with a modified build of quarkus-bootstrap-app-model-3.8.4.jar, and determined that the detritus files do not appear until after Files.delete(p) is called. My suspicion is that these calls inside a DirectoryStream iteration loop result in a race condition on some NFS volumes (i.e. the directory listing is being updated at the same time as it's being read). I was able to fix it by changing a single line to avoid these detritus files:

diff --git a/independent-projects/bootstrap/app-model/src/main/java/io/quarkus/bootstrap/util/IoUtils.java b/independent-projects/bootstrap/app-model/src/main/java/io/quarkus/bootstrap/util/IoUtils.java
index ba28f56..7a77f2c 100644
--- a/independent-projects/bootstrap/app-model/src/main/java/io/quarkus/bootstrap/util/IoUtils.java
+++ b/independent-projects/bootstrap/app-model/src/main/java/io/quarkus/bootstrap/util/IoUtils.java
@@ -110,7 +110,7 @@ public static void createOrEmptyDir(Path dir) throws IOException {
             for (Path p : stream) {
                 if (Files.isDirectory(p)) {
                     recursiveDelete(p);
-                } else {
+                } else if (!p.getFileName().toString().startsWith(".nfs")) {
                     Files.delete(p);
                 }
             }

However I am not a Java developer under normal circumstances so I don't have strong feelings about what the correct approach is here. (I'm happy to submit a pull request with my change however.)

Expected behavior

Build is able to proceed without NFS conflicts.

Actual behavior

ERROR: Failed to run 'build' command.
ERROR: io.quarkus.builder.BuildException: Build failure: Build failed due to errors
        [error]: Build step io.quarkus.deployment.pkg.steps.JarResultBuildStep#buildRunnerJar threw an exception: java.lang.IllegalStateException: java.nio.file.FileSystemException: /nfs/home/nat/keycloak-24.0.5/lib/quarkus/.nfs804a8212004d86bb000af9da: Device or resource busy
        at io.quarkus.deployment.ExtensionLoader$3.execute(ExtensionLoader.java:858)
        at io.quarkus.builder.BuildContext.run(BuildContext.java:256)
        at org.jboss.threads.ContextHandler$1.runWith(ContextHandler.java:18)
        at org.jboss.threads.EnhancedQueueExecutor$Task.run(EnhancedQueueExecutor.java:2513)
        at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1538)
        at java.base/java.lang.Thread.run(Thread.java:842)
        at org.jboss.threads.JBossThread.run(JBossThread.java:501)
Caused by: java.nio.file.FileSystemException: /nfs/home/nat/keycloak-24.0.5/lib/quarkus/.nfs804a8212004d86bb000af9da: Device or resource busy
        at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:100)
        at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
        at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
        at java.base/sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:248)
        at java.base/sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:105)
        at java.base/java.nio.file.Files.delete(Files.java:1152)
        at io.quarkus.bootstrap.util.IoUtils.createOrEmptyDir(IoUtils.java:114)
        at io.quarkus.deployment.pkg.steps.JarResultBuildStep.buildThinJar(JarResultBuildStep.java:601)
        at io.quarkus.deployment.pkg.steps.JarResultBuildStep.buildRunnerJar(JarResultBuildStep.java:224)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:568)
        at io.quarkus.deployment.ExtensionLoader$3.execute(ExtensionLoader.java:849)
        ... 6 more

ERROR: Build failure: Build failed due to errors
        [error]: Build step io.quarkus.deployment.pkg.steps.JarResultBuildStep#buildRunnerJar threw an exception: java.lang.IllegalStateException: java.nio.file.FileSystemException: /nfs/home/nat/keycloak-24.0.5/lib/quarkus/.nfs804a8212004d86bb000af9da: Device or resource busy
        at io.quarkus.deployment.ExtensionLoader$3.execute(ExtensionLoader.java:858)
        at io.quarkus.builder.BuildContext.run(BuildContext.java:256)
        at org.jboss.threads.ContextHandler$1.runWith(ContextHandler.java:18)
        at org.jboss.threads.EnhancedQueueExecutor$Task.run(EnhancedQueueExecutor.java:2513)
        at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1538)
        at java.base/java.lang.Thread.run(Thread.java:842)
        at org.jboss.threads.JBossThread.run(JBossThread.java:501)
Caused by: java.nio.file.FileSystemException: /nfs/home/nat/keycloak-24.0.5/lib/quarkus/.nfs804a8212004d86bb000af9da: Device or resource busy
        at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:100)
        at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
        at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
        at java.base/sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:248)
        at java.base/sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:105)
        at java.base/java.nio.file.Files.delete(Files.java:1152)
        at io.quarkus.bootstrap.util.IoUtils.createOrEmptyDir(IoUtils.java:114)
        at io.quarkus.deployment.pkg.steps.JarResultBuildStep.buildThinJar(JarResultBuildStep.java:601)
        at io.quarkus.deployment.pkg.steps.JarResultBuildStep.buildRunnerJar(JarResultBuildStep.java:224)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:568)
        at io.quarkus.deployment.ExtensionLoader$3.execute(ExtensionLoader.java:849)
        ... 6 more

How to Reproduce?

This was trivially reproducible for me by downloading https://github.com/keycloak/keycloak/releases/download/24.0.5/keycloak-24.0.5.tar.gz and running ./bin/kc.sh start-dev with the Oracle JDK v17 in my environment. Considering that nobody else has reported this error it seems very specific to my NFS environment, but we don't usually encounter problems like this.

Output of uname -a or ver

Linux mp1505-sge66 3.10.0-1160.71.1.el7.x86_64 #1 SMP Tue Jun 28 15:37:28 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Output of java -version

java version "17.0.10" 2024-01-16 LTS Java(TM) SE Runtime Environment (build 17.0.10+11-LTS-240) Java HotSpot(TM) 64-Bit Server VM (build 17.0.10+11-LTS-240, mixed mode, sharing)

Quarkus version or git rev

3.8.4

Build tool (ie. output of mvnw --version or gradlew --version)

No response

Additional information

No response

natechols commented 4 months ago

Possibly related to https://github.com/quarkusio/quarkus/issues/39210, or at least a similar failure mode?

quarkus-bot[bot] commented 4 months ago

/cc @Sanne (core), @aloubyansky (core), @gsmet (core), @radcortez (core), @stuartwdouglas (core)

dmlloyd commented 4 months ago

I don't have an NFS environment, but could you possibly see if #41787 fixes the issue?

natechols commented 4 months ago

I tried building both the latest main and dmlloyd:ioutils the same way I built 3.8.4 and immediately got stuck here:

[ERROR] Failed to execute goal net.revelc.code.formatter:formatter-maven-plugin:2.24.1:format (default) on project quarkus-bootstrap-parent: Execution default of goal net.revelc.code.formatter:formatter-maven-plugin:2.24.1:format failed: Plugin net.revelc.code.formatter:formatter-maven-plugin:2.24.1 or one of its dependencies could not be resolved:
[ERROR]
[ERROR] : The following artifacts could not be resolved: io.quarkus:quarkus-ide-config:jar:999-SNAPSHOT (absent): io.quarkus:quarkus-ide-config:jar:999-SNAPSHOT was not found in https://oss.sonatype.org/content/repositories/snapshots during a previous attempt. This failure was cached in the local repository and resolution is not reattempted until the update interval of sonatype-nexus-snapshots has elapsed or updates are forced

Is Java 17 getting too old for this?

gastaldi commented 4 months ago

@natechols Are you building from the root directory? I use mvn -Dquickly -T2C and it works. I'm building with Eclipse Adoptium 17.0.11

natechols commented 4 months ago

@gastaldi Thanks, adding -T2C got it working - what does this flag mean?

@dmlloyd I can start Keycloak with quarkus-bootstrap-app-model-3.8.4.jar replaced by the equivalent jarfile built from your branch, but I'm now having trouble reproducing the original failure mode (although it was first encountered many weeks ago)... stay tuned.

Sanne commented 4 months ago

I've worked with Java applications needing precise concurrency control on the filesystem in the past, and after fighting NFS peculiarities for months it was decided it was safer to not support it for such operations. It could be done of course, but it quickly escalates as an anti-economical choice, especially as it would require extensive ongoing testing.

If one really wants it fixed, never reuse a filename; especially don't try to create a file with the same name as samething you have deleted previously or otherwise rely on it maintaining order of operations. A lot depends on how exactly it's configured though and I can't say I remember the details, it's been a long time.