quarkusio / quarkus

Quarkus: Supersonic Subatomic Java.
https://quarkus.io
Apache License 2.0
13.82k stars 2.69k forks source link

Quarkus boot fails when vertx-cache temporary directory exists #7678

Closed rsvoboda closed 3 years ago

rsvoboda commented 4 years ago

Quarkus boot fails when vertx-cache temporary directory exists. Running on Windows 10, using GraalVM 20 JDK 11 based, native image

2020-03-08 16:15:06,354 ERROR [io.qua.application] (main) Failed to start application: java.lang.IllegalStateException: 
 Failed to create cache dir: C:\Users\ROSTIS~1\AppData\Local\Temp\\vertx-cache/file-cache-9cf306ad-0c1b-4d8f-8468-f6d148da5859

First I thought the problem is in the file path which mixes Windows and Linux path separators, but that is not the problem. The problem is that file-cache-9cf306ad-0c1b-4d8f-8468-f6d148da5859 directory exists and Quarkus native binary boot fails due to that fact. This directory was generated during mvn clean install -Dnative execution but wasn't cleaned at the end.

Workaround is to remove that temporary directory and after this step Quarkus .exe app boots without problem, temporary directory gets cleaned during the shutdown. So following executions run without problem. The same file-cache directory gets generated on each run.

Few points:

The third point is the biggest issue from my perspective as it blocks portability (C:\Users\ROSTIS~1\AppData\Local\Temp\ would be expected on every system the app gets started?) I tried on another windows machine and the path is not fully hard-coded, user's tmp directory is reflected, but the final directory is always file-cache-9cf306ad-0c1b-4d8f-8468-f6d148da5859, Always the same path can be risky from a security perspective, knowledgeable users know where to look beforehand.

Steps to reproduce on sample bootstrap app:

mvn clean install -Dnative
target\getting-started-1.0-SNAPSHOT-runner.exe

Screenshot 2020-03-08 16 15 40

jaikiran commented 4 years ago

but the final directory is always file-cache-9cf306ad-0c1b-4d8f-8468-f6d148da5859

Looking at the code in vert.x[1], it always uses a call to UUID.randomUUID() for the directory name. Do you mean the same UUID is being generated across multiple runs of the build process? What Windows OS is this?

[1] https://github.com/eclipse-vertx/vert.x/blob/master/src/main/java/io/vertx/core/file/impl/FileResolver.java#L325

cescoffier commented 4 years ago

I was wondering the same. While the base dir is fixed the actually name is generated at runtime.

rsvoboda commented 4 years ago

I'm talking about native image on Windows 10.

In JVM mode different directory gets used with every run, with native image binary the directory is always the same (for that one binary).

rsvoboda commented 4 years ago

While the base dir is fixed the actually name is generated at runtime.

True for JVM mode, not in native mode.

Native mode will make our heads hairless:)

cescoffier commented 4 years ago

Just to be sure, it only happens on windows?

gsmet commented 4 years ago

Maybe the boot failing is happening only on Windows but I would bet the hardcoded temp directory is a global issue if the directory name is generated at static init.

cescoffier commented 4 years ago

The directory name is computed at runtime and is not stored in a static field (the File Resolver is created in the VertxImpl constructor). Also, the directory should be clearer on shutdown (at least it's what the code is trying to do...).

rsvoboda commented 4 years ago

I've seen it on Windows only, maybe it's because Windows are more strict when it comes to filesystem and locks on files.

the directory should be clearer on shutdown

Sure, but that didn't happen with mvn clean install -Dnative, vertx-cache directory was left there and target\getting-started-1.0-SNAPSHOT-runner.exe (after maven command) failed because the directory existed.

rsvoboda commented 4 years ago

So I tried on macOS and the tmp folder name is unique on every run of native app. On macOS I'm using 19.3.1, on Windows machine I used GraalVM 20 as suggested in PR for Windows native by Tomaz.

3x execution of native binary with Installed features: [cdi, resteasy]

/var/folders/nn/w1bsdgrx69n9gch4w4vkrkv40000gn/T/vertx-cache/file-cache-2554416b-2b60-4a9e-bf62-4de1b02029e4
/var/folders/nn/w1bsdgrx69n9gch4w4vkrkv40000gn/T/vertx-cache/file-cache-def0845c-f09f-4888-8869-b7555b488519
/var/folders/nn/w1bsdgrx69n9gch4w4vkrkv40000gn/T/vertx-cache/file-cache-8eae8c90-220d-417b-81a3-b82c8f0de9b4
cescoffier commented 4 years ago

I will have a look tomorrow. I need to setup a Windows 10 VM. We need to test it with GraalVM 19.3 and 20.x.

Clement On 9 Mar 2020 at 13:06 +0100, Rostislav Svoboda notifications@github.com, wrote:

So I tried on macOS and the tmp folder name is unique on every run of native app. On macOS I'm using 19.3.1, on Windows machine I used GraalVM 20 as suggested in PR for Windows native by Tomaz. 3x execution of native binary with Installed features: [cdi, resteasy] /var/folders/nn/w1bsdgrx69n9gch4w4vkrkv40000gn/T/vertx-cache/file-cache-2554416b-2b60-4a9e-bf62-4de1b02029e4 /var/folders/nn/w1bsdgrx69n9gch4w4vkrkv40000gn/T/vertx-cache/file-cache-def0845c-f09f-4888-8869-b7555b488519 /var/folders/nn/w1bsdgrx69n9gch4w4vkrkv40000gn/T/vertx-cache/file-cache-8eae8c90-220d-417b-81a3-b82c8f0de9b4 — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

cescoffier commented 4 years ago

Unfortunately, I'm not able to build a native-image on Windows...

Error C2059: syntax error: '~'
rsvoboda commented 4 years ago

@cescoffier do you have the right tools installed on your Windows box? See https://github.com/quarkusio/quarkus/pull/7663#issue-385067018

jaikiran commented 4 years ago

@rsvoboda I forgot to ask, does this only happen with 20.x of GraalVM or even with 19.3.x? Either way, I still can't see how this exception can happen unless either:

rsvoboda commented 4 years ago

It's my home laptop, Windows is the host OS, no VM. Using GraalVM 20, didn't try 19.3 as Windows support for native-image improved with GraalVM 20.

rsvoboda commented 4 years ago

I tried to use 19.3.1 on Windows machine, but it can't be used with Quarkus, getting error

[INFO] [io.quarkus.deployment.pkg.steps.NativeImageBuildStep] C:\Program Files\GraalVM\graalvm-ce-java11-19.3.1\bin\native-image.cmd -J-Dsun.nio.ch.maxUpdateArraySize=100 -J-Djava.util.logging.manager=org.jboss.logmanager.LogManager -J-Dvertx.logger-delegate-factory-class-name=io.quarkus.vertx.core.runtime.VertxLogDelegateFactory -J-Dvertx.disableDnsResolver=true -J-Dio.netty.leakDetection.level=DISABLED -J-Dio.netty.allocator.maxOrder=1 -J-Duser.language=cs -J-Dfile.encoding=Cp1250 --initialize-at-build-time= -H:InitialCollectionPolicy=com.oracle.svm.core.genscavenge.CollectionPolicy$BySpaceAndTime -H:+JNI -jar getting-started-1.0-SNAPSHOT-runner.jar -H:FallbackThreshold=0 -H:+ReportExceptionStackTraces -H:-AddAllCharsets -H:-IncludeAllTimeZones -H:EnableURLProtocols=http -H:-UseServiceLoaderFeature -H:+StackTrace getting-started-1.0-SNAPSHOT-runner
Warning: the '=' character in program arguments is not fully supported.
Make sure that command line arguments using it are wrapped in double quotes.
Example:
"--vm.Dfoo=bar"

Error: Unknown arguments: org.jboss.logmanager.LogManager, io.quarkus.vertx.core.runtime.VertxLogDelegateFactory, true, DISABLED, 1, cs, Cp1250, com.oracle.svm.core.genscavenge.CollectionPolicy$BySpaceAndTime, 0, http, getting-started-1.0-SNAPSHOT-runner
rsvoboda commented 4 years ago

Windows native-image based binary generates always the same UUID.

I filed an issue in GraalVM tracker - https://github.com/oracle/graal/issues/2265

karesti commented 3 years ago

@rsvoboda @cescoffier I'm experiencing this issue with jvm and native mode in openshift. not sure what I'm doing wrong

karesti commented 3 years ago

not sure if this is the issue exactly I get this: 2021-02-16 22:27:32,212 ERROR [io.qua.run.Application] (main) Failed to start application (with profile prod): java.lang.IllegalStateException: Failed to create cache dir: /tmp/vertx-cache/file-cache-840facac-d89f-4b7b-9980-0f48350ea364

cescoffier commented 3 years ago

@karesti in your case I believe it's because you don't have the permission to write in /tmp, can you check that you have the permissions?

Any way to get a reproducer if that's not the case?

karesti commented 3 years ago

@cescoffier oh ok. I will check this with the team. How can I give those permissions in openshift ?

karesti commented 3 years ago

@cescoffier I will check with the team so I get it solved if it's a matter of permissions

cescoffier commented 3 years ago

Normally on OpenShift /tmp is writable. Did it change?

karesti commented 3 years ago

So, the problem is that the docker file uses RUN chown 1001 /work \ and in openshift the user changes so the /tmp file has no rights. I fixed the issue by using this docker file:

FROM registry.access.redhat.com/ubi8/ubi-minimal:8.3
WORKDIR /work/
RUN chown 1001 /work \
    && chmod "g+rwX" /work \
    && mkdir /tmp/vertx-cache \
    && chown 1001:root /work \
    && chmod 775 /tmp/vertx-cache

COPY --chown=1001:root target/*-runner /work/application

EXPOSE 8080
USER 1001

CMD ["./application", "-Dquarkus.http.host=0.0.0.0"]
karesti commented 3 years ago

@cescoffier I forgot to ping you in my comment above

rsvoboda commented 3 years ago

The default Dockerfile.native is:

FROM registry.access.redhat.com/ubi8/ubi-minimal:8.3
WORKDIR /work/
RUN chown 1001 /work \
    && chmod "g+rwX" /work \
    && chown 1001:root /work
COPY --chown=1001:root target/*-runner /work/application

EXPOSE 8080
USER 1001

CMD ["./application", "-Dquarkus.http.host=0.0.0.0"]

So it needs to be adjusted?

In other words: is /tmp/vertx-cache something specific just for your use-case or it's common for users to define it on their own?

karesti commented 3 years ago

@rsvoboda I did not use anything specific or custom. the quarkus app dit not start simply. I used 4 extensions: websockets, infinispan, rest and scheduler

cescoffier commented 3 years ago

I just tried (with a different set of extensions but would still use vert.x) on an OpenShift on Azure and no problem.

D'oh didn't see the ubi-minimal version, I may have an older version.

rsvoboda commented 3 years ago

Which version of OpenShift was used? Especially interested in the version used by @karesti.

cescoffier commented 3 years ago

Just checked with the very last ubi:8.3 - no problem. So might be an OpenShift permission issue.

karesti commented 3 years ago

@rsvoboda I'm using a dedicated version 4.6.8

cescoffier commented 3 years ago

@rsvoboda is that a setting you can reproduced? I don't have "dedicated", I only have cloud instances.

rsvoboda commented 3 years ago

We were not able to reproduce. As if there were some specific permissions specified in the used OpenShift cluster. @pjgg was trying to reproduce this trouble.

pjgg commented 3 years ago

@karesti I am trying to reproduce your issue without success. Could you try to deploy on your cluster the following image:

quay.io/pjgg/webclient:0.0.1

For example, a valid YAML could be something like this:

apiVersion: v1
kind: Pod
metadata:
  name: security-demo
spec:
  volumes:
  - name: demo
    emptyDir: {}
  containers:
  - name: demo
    image: quay.io/pjgg/webclient:0.0.1
    livenessProbe:
      httpGet:
          path: /q/health/live
          port: 8080
      initialDelaySeconds: 59
      timeoutSeconds: 5
    readinessProbe:
      httpGet:
          path: /q/health/ready
          port: 8080
      initialDelaySeconds: 30
      timeoutSeconds: 5
karesti commented 3 years ago

@pjgg this pod deploy worked! I used the scheduler and websockets extension (if this might be an issue)

karesti commented 3 years ago

I will test the deployment in another way

karesti commented 3 years ago

@pjgg So I tested using deployment config and 2 pods and works as well. Might the scheduler or websockets extensions be related?

pjgg commented 3 years ago

I am not sure, yet. Could you show me your dependencies? (maven/gradle) just to set up an scenario with the same version as you?.

When you said:

I tested using deployment config and 2 pods and works as well.

Do you mean that you use your application and the default Docker file, and works?

I have tested with a scheduler (quartz), and looks ok!. Could be tested on your environment, with this docker image: quay.io/pjgg/webclient:0.0.1-a

thomas-mc-work commented 3 years ago

Another aspect that might be relevant: I'm running two instances of the same native runner on a machine on separate user accounts. The first one can be launched normally. The second one throws the same exception as above. The cause is that the first user has created and is thus owning /tmp/vertx-cache. This makes the second process fail to write into the same path.

Both users have defined the paths to separate temp dirs in the environment, e.g. for one of them:

# env | grep -i tmp
TEMPDIR=/tmp/user/107329
TMPDIR=/tmp/user/107329
TEMP=/tmp/user/107329
TMP=/tmp/user/107329

I guess that the /tmp base path is either hard coded or taken from my development machine (or even from within the GraalVM container). So reading this value at runtime would be a benefit.

For me the workaround using -Djava.io.tmpdir=$TMPDIR has helped.

cescoffier commented 3 years ago

The vertx-cache directory should be world-writable, and it might not be the case. I need to investigate this.

cescoffier commented 3 years ago

So, the vertx-cache is configured by quarkus but created by vert.x. A workaround could be to generate a unique cache at each startup.

@rsvoboda any concern with such an approach?

rsvoboda commented 3 years ago

I spoke with @pjgg about this. In general we are ok with such change.

The only concern we see is around @karesti use-case whe Katia worked with known path /tmp/vertx-cache. On the other side we were not able to reproduce her problem. If the unique cache is based on timestamp we could create a vertx-cache/4163465465465 temporal folder, instead of vertx-cache-4163465465465 so karesti workaround could still work.

About unique cache generate at each startup - also in NATIVE mode?

cescoffier commented 3 years ago

Yes, it's going to be generated on every application start, so even in native mode, multiple runs will generate multiple directories.

juangon commented 2 years ago

FYI @rsvoboda I am able to reproduce this error on Quarkus 2.7.X in NO NATIVE (didn't tried on native), not reproducible in Quarkus 2.6.3.Final.

juangon commented 2 years ago

Oh BTW, not testing in Windows. It happens in a Kubernetes cluster, using a new PVC mounted in /tmp directory. None of this fails in Quarkus 2.6.3.Final

rsvoboda commented 2 years ago

Please file new issue with steps to reproduce, this sounds like a new variant of the problem.

juangon commented 2 years ago

Please file new issue with steps to reproduce, this sounds like a new variant of the problem.

Ok so it seems I had to create an init-containers in order to set the permissions properly. Thanks very much!

rsvoboda commented 2 years ago

@juangon Can you provide a snippet of changes you did so that other people can learn from your fix? Assuming you have custom yaml descriptor and not relying on kubernetes yaml generated by Quarkus extension.

juangon commented 2 years ago

Yes sure, I don't have a custom YAML descriptor, but will write here anyway.

Here is the PVC for a microservice:

quarkus.kubernetes.pvc-volumes.backend-persistent-storage.claim-name=pv-claim
quarkus.kubernetes.mounts.backend-persistent-storage.path=/tmp

And here the init-containers

quarkus.kubernetes.init-containers.set-tmp-permissions.image=registry.access.redhat.com/ubi8/ubi:latest
quarkus.kubernetes.init-containers.set-tmp-permissions.mounts.backend-persistent-storage.path=/tmp
quarkus.kubernetes.init-containers.set-tmp-permissions.command=chown,-R,185,/tmp

User is 185 as existing in latest Dockerfile

rsvoboda commented 2 years ago

Thanks!