Closed ryanemerson closed 7 months ago
@ryanemerson is this happening with quay.io/quarkus/ubi-quarkus-mandrel-builder-image:jdk-21
as well? Also what Quarkus version are you using?
cc @Karm
We're using Quarkus 3.6.0
.
I tried the mandrel based images and the build time is now down to ~ 2.5 hours, so there's some improvement.
Hi @ryanemerson could you please also provide a few more info on your setup?
E.g. what are the hardware specs of the amd64 machine you are using for the builds, and what are those of the arm64 machine you are comparing with?
Upon further investigation, it seems this is specifically caused by the arm64 builder image, as only building for amd64 results in the build time coming back down to ~ 16 mins.
Did you also try building only for arm64? What are the results?
Hardware is a m4.xlarge
AWS instance with 4 vCPU 2.4 GHz Intel Xeon E5-2676 v3** Processor. We're using qemu emulation so that we can build the image with docker buildx
.
Exact workflow:
docker run --rm --privileged quay.io/infinispan-test/binfmt:qemu-v8.0.4-33 --install arm64
docker buildx create --name multiarch --use
docker buildx build --platform linux/arm64 -t ${imageFQN} target-${name}/image
Did you also try building only for arm64? What are the results?
Yes. An arm64
only build takes 1h 46min with quay.io/quarkus/ubi-quarkus-mandrel-builder-image:jdk-21
whereas our old build with quay.io/quarkus/ubi-quarkus-native-image:22.3-java17
takes 10min 6s.
Thank you for the extra information @ryanemerson. We will try to replicate and investigate the issue.
As a first step, I tried reproducing the issue on a local AMD64 machine (using podman run --platform linux/arm64
) and I get the following results:
Mandrel Version | Total Build time on QEMU arm64 | Total Build Time natively on amd64 |
---|---|---|
22.3.5.0 (Java 17) | 6m 44s | 26.2s |
23.0.3.0 (Java 17) | 8m 56s | 21.5s |
23.1.2.0 (Java 21) | 8m 9s | 20.1s |
Some interesting differences between the arm64 runs:
I will investigate further...
I will investigate further...
After some more experimentation it looks like the slow down is related to the initial heap size. Setting Xms
to 7g (i.e., passing -Dquarkus.native.additional-build-args=-J-Xms7g
) I was able to get 23.1 to build in similar time with 22.3 (6m 15s).
The issue seems related to https://github.com/oracle/graal/pull/6432
@ryanemerson could you please give this a try, while I try to better understand why that's happening? Please use as the initial heap size something a bit higher than the Peak RSS you get when building with Mandrel 22.3.
Thanks for looking into this.
The most recent builder images have significantly reduced the latencies we were experiencing when I first created this issue, however the total build time is still almost double what we experienced with quay.io/quarkus/ubi-quarkus-native-image:22.3-java17
.
Adding -Dquarkus.native.additional-build-args=-J-Xms7g
only reduced the build time slightly, 49min 43s vs 52min 46s.
Adding -Dquarkus.native.additional-build-args=-J-Xms7g only reduced the build time slightly, 49min 43s vs 52min 46s.
What is the Peak RSS reported when building with 22.3-java17 without using this option?
while I try to better understand why that's happening?
So the actual issue is that https://github.com/oracle/graal/pull/6432 is setting GCTime
to 9
instead of the default 99
and that seems to lead the JVM to perform more heap size adaptations which in turn seems to lead to the slowdown you are observing.
@ryanemerson may I ask you to give this a go with -Dquarkus.native.additional-build-args=J-XX:GCTimeRatio=99
as well and report back? Since the differences I observe are only ~1m on my machine I would really like to know if this resolves the issue in your case which seems to have a much bigger impact.
cc @fniephaus
What is the Peak RSS reported when building with 22.3-java17 without using this option?
Peak RSS: 7.21GB
@ryanemerson may I ask you to give this a go with
-Dquarkus.native.additional-build-args=J-XX:GCTimeRatio=99
as well and report back?
Sure np.
So the actual issue is that https://github.com/oracle/graal/pull/6432 is setting GCTime to 9 instead of the default 99 and that seems to lead the JVM to perform more heap size adaptations which in turn seems to lead to the slowdown you are observing.
GCTime=99 is mostly for latency and leads to the build process quickly using as much memory as it is allowed to use (bigger peak RSS). With GCTime=9, we tweak the GC more towards throughput, allowing it to spend more time cleaning up while not allocating more memory (lower peak RSS).
I haven't seen any actual build output in this issue, but "very slow" sounds to me like the app simply requires more memory to be built with GraalVM. How much memory/CPU is the build process allowed to use? A build time of ~50min could mean that 7GB is simply not enough.
Setting GCTime=99 made no noticeable difference to build time.
I haven't seen any actual build output in this issue, but "very slow" sounds to me like the app simply requires more memory to be built with GraalVM. How much memory/CPU is the build process allowed to use? A build time of ~50min could mean that 7GB is simply not enough.
We have Xmx set to 8g before and after the increased latency was observed between the two different builder images. I increased this to 16g and the build time remains the same.
@ryanemerson can you share the output of native-image when building with 23.1 (jdk-21
tag) and when building with 22.3 (22.3-java17
tag)?
And just to make sure are you using Quarkus 3.6.0 in both cases? If not you might be hitting https://github.com/quarkusio/quarkus/issues/38683 (although if that was the case it shouldn't show up only on aarch64)
A reproducer could also be handy if you can share.
I increased this to 16g and the build time remains the same.
Are you sure that you also bumped the Xmx value? If you did, it seems memory is not the bottleneck, maybe it's CPU. You could try by increasing the number of cores available in your container.
I've created a standalone reproducer to simplify things: https://github.com/ryanemerson/quarkus-arm64-slow-reproducer
Here's the output for building with the two different builder images, on the same machine using the same args with Quarkus 3.7.3:
You can see that the total build time for 22.3-java17
was only 04:49 min, whereas jdk-21
was 36:36 min.
Thanks for the reproducer and the extra info @ryanemerson
At first sight it still looks like a GC-related issue to me:
jdk-21
image is trying to do more work.jdk-21
image reports that it's allowed to use up to 23.5GB of memory, yet the peak RSS is only 1.88GB in contrast to 5.93GB when using 22.3-java17
.I will try the reproducer and have another look next week.
@ryanemerson thanks again for the reproducer and output results. I was finally able to see what's wrong.
After a closer inspection of the logs I noticed that the build is actually running on x86_64
instead of aarch64
when using 22.3-java17
.
This led me to have a second look at your Dockerfiles and the images they use.
Dockerfile.22.3-java17
uses quay.io/quarkus/ubi-quarkus-native-image:22.3-java17
which is not a multi-arch image (and is also no longer supported). Multi-arch images were introduced in https://github.com/quarkusio/quarkus-images/pull/200 and the image naming changed. As a result if you want to use 22.3-java17
you should use one of the following images:
quay.io/quarkus/ubi-quarkus-mandrel-builder-image:22.3-java17
for Mandrelquay.io/quarkus/ubi-quarkus-graalvmce-builder-image:22.3-java17
for GraalVM CEAt this point it might be worth mentioning that the latest builder image for Java 17 is tagged with jdk-17
so please use that instead of 22.3-java17
unless you have a good reason not to.
Applying the following patch to the reproducer I am getting more consistent results (the jdk-17
is still slower by ~1m due to the reason explained in https://github.com/quarkusio/quarkus-images/issues/260#issuecomment-1944060767). The bad news is that the correct results are the slow ones (and emulation should be the one to blame here).
diff --git a/src/main/docker/Dockerfile.22.3-java17 b/src/main/docker/Dockerfile.22.3-java17
index d108905..b6c718f 100644
--- a/src/main/docker/Dockerfile.22.3-java17
+++ b/src/main/docker/Dockerfile.22.3-java17
@@ -1,4 +1,4 @@
-FROM quay.io/quarkus/ubi-quarkus-native-image:22.3-java17 as build
+FROM quay.io/quarkus/ubi-quarkus-mandrel-builder-image:22.3-java17 as build
COPY --chown=quarkus:quarkus mvnw /code/mvnw
COPY --chown=quarkus:quarkus .mvn /code/.mvn
COPY --chown=quarkus:quarkus pom.xml /code/
diff --git a/src/main/docker/Dockerfile.jdk-21 b/src/main/docker/Dockerfile.jdk-21
index b5ab2a8..b7c63a9 100644
--- a/src/main/docker/Dockerfile.jdk-21
+++ b/src/main/docker/Dockerfile.jdk-21
@@ -1,4 +1,4 @@
-FROM quay.io/quarkus/ubi-quarkus-graalvmce-builder-image:jdk-21 as build
+FROM quay.io/quarkus/ubi-quarkus-mandrel-builder-image:jdk-21 as build
COPY --chown=quarkus:quarkus mvnw /code/mvnw
COPY --chown=quarkus:quarkus .mvn /code/.mvn
COPY --chown=quarkus:quarkus pom.xml /code/
As a follow up question, I am curious whether you actually test the images you build with 22.3-java17
on aarch64 architectures? If you do how come the tests don't fail? Are they silently running on x86 as well?
I am closing this issue as it's actually not an issue with the images themselves. For the record I am adding the build outputs I get from the correct images below:
After a closer inspection of the logs I noticed that the build is actually running on
x86_64
instead ofaarch64
when using22.3-java17
.This led me to have a second look at your Dockerfiles and the images they use.
Dockerfile.22.3-java17
usesquay.io/quarkus/ubi-quarkus-native-image:22.3-java17
which is not a multi-arch image (and is also no longer supported). Multi-arch images were introduced in #200 and the image naming changed
Well I feel dumb :sweat_smile: We don't have any automated testing for our arm images, they're provided on a best effort basis for community user, which is why this wasn't detected. It seems nobody is actually using these images.
Thanks for looking into this @zakkak, much appreciated.
Is there still an issue on the Native Image side? https://github.com/quarkusio/quarkus-images/issues/260#issuecomment-1944060767 is somewhat expected and https://github.com/quarkusio/quarkus-images/issues/260#issuecomment-1960107015 shows the result: while the build takes ~1min longer on JDK 17, it only needs 2.38GB as opposed to 6.60GB of memory, even on a machine with 75.6% of 30.60GB of memory available.
Is there still an issue on the Native Image side?
I think not.
#260 (comment) is somewhat expected and #260 (comment) shows the result: while the build takes ~1min longer on JDK 17, it only needs 2.38GB as opposed to 6.60GB of memory, even on a machine with 75.6% of 30.60GB of memory available.
True, but whether that's good or bad really depends on the use case. I have opened https://github.com/quarkusio/quarkus/issues/38968 to give some options to Quarkus users, perhaps it would make sense to implement something similar directly on GraalVM.
The Infinispan project previously used
quay.io/quarkus/ubi-quarkus-native-image:22.3-java17
as a builder image to create various native components for botharm64
andamd64
architectures. The total time taken for all of our images was ~ 30 mins.In order to use the latest GraalVM JDK 21 distribution, I have updated the builder image to be based upon
quay.io/quarkus/ubi-quarkus-graalvmce-builder-image:jdk-21
. However this has dramatically slowed down our image build time, with all our images now taking ~ 4 hours. Upon further investigation, it seems this is specifically caused by thearm64
builder image, as only building foramd64
results in the build time coming back down to ~ 16 mins.Has anything changed between
ubi-quarkus-native-image
andubi-quarkus-graalvmce-builder-image
that could explain this increased build time, or is the culprit more likely to be GraalVM itself?