quarkusio / quarkus

Quarkus: Supersonic Subatomic Java.
https://quarkus.io
Apache License 2.0
13.87k stars 2.71k forks source link

`ImageMetricsITCase.verifyImageMetrics` failures in `main` #38919

Closed zakkak closed 9 months ago

zakkak commented 9 months ago

Describe the bug

jpa-postgressql-withxml fails in ImageMetricsITCase.verifyImageMetrics test

See https://github.com/quarkusio/quarkus/actions/runs/7985178432/job/21803563937#step:16:776

The regression was introduced in https://github.com/quarkusio/quarkus/pull/38820

Note: As the changes are in vertx-http this probably affects other tests as well, but the impact doesn't happen to be big enough to trigger a test failure for them.

Expected behavior

The test should pass.

Actual behavior

The test fails with:

org.opentest4j.AssertionFailedError: Expected analysis_results.fields.reflection to be within range [163 +- 3%] but was 168 ==> expected: <true> but was: <false>

How to Reproduce?

./mvnw -Dnative -pl integration-tests/jpa-postgresql-withxml -Dnative.surefire.skip -Dformat.skip -Dno-descriptor-tests clean verify -Dtest-containers -Dstart-containers

Output of uname -a or ver

No response

Output of java -version

No response

Mandrel or GraalVM version (if different from Java)

No response

Quarkus version or git rev

No response

Build tool (ie. output of mvnw --version or gradlew --version)

No response

Additional information

After https://github.com/quarkusio/quarkus/pull/38820 I see the following differences:

diff --git a/old.json b/new.json
index c60df879106..b1c92fa16f2 100644
--- a/old.json
+++ b/new.json
@@ -2,30 +2,30 @@
   "resource_usage": {
     "memory": {
       "system_total": 66506358784,
-      "peak_rss_bytes": 6114426880
+      "peak_rss_bytes": 8380002304
     },
     "garbage_collection": {
-      "count": 119,
-      "total_secs": 3.413,
-      "max_heap": 8176271360
+      "count": 66,
+      "total_secs": 2.625,
+      "max_heap": 28445245440
     },
     "cpu": {
-      "load": 12.15073613200237,
+      "load": 12.75164113785558,
       "parallelism": 32,
       "total_cores": 32
     },
-    "total_secs": 43.448367164000004
+    "total_secs": 43.512096230000004
   },
   "image_details": {
     "code_area": {
-      "bytes": 44577920,
-      "compilation_units": 66680
+      "bytes": 44601456,
+      "compilation_units": 66734
     },
-    "total_bytes": 93915968,
+    "total_bytes": 93952832,
     "image_heap": {
-      "bytes": 48922624,
+      "bytes": 48934912,
       "objects": {
-        "count": 485976
+        "count": 486273
       },
       "resources": {
         "bytes": 1616088,
@@ -47,29 +47,29 @@
   },
   "analysis_results": {
     "types": {
-      "total": 23615,
+      "total": 23629,
       "reflection": 6697,
       "jni": 61,
-      "reachable": 20816
+      "reachable": 20830
     },
     "methods": {
       "foreign_downcalls": -1,
-      "total": 188214,
-      "reflection": 4899,
+      "total": 188269,
+      "reflection": 4900,
       "jni": 55,
-      "reachable": 102637
+      "reachable": 102708
     },
     "classes": {
-      "total": 23615,
+      "total": 23629,
       "reflection": 6697,
       "jni": 61,
-      "reachable": 20816
+      "reachable": 20830
     },
     "fields": {
-      "total": 49625,
-      "reflection": 166,
+      "total": 49643,
+      "reflection": 168,
       "jni": 61,
-      "reachable": 30008
+      "reachable": 30032
     }
   }
 }

@cescoffier do the above diffs looks reasonable according to your changes?

quarkus-bot[bot] commented 9 months ago

/cc @Karm (mandrel), @ebullient (metrics), @galderz (mandrel), @jmartisk (metrics)

gsmet commented 9 months ago

This doesn't look pretty:

-      "peak_rss_bytes": 6114426880
+      "peak_rss_bytes": 8380002304

but are we sure they are solely due to Clément's PR?

zakkak commented 9 months ago

but are we sure they are solely due to Clément's PR?

That was the only difference between the two runs (I tested with 31fcfc1088363cabb618a5af7f90038a160961bf and fd46a61cc74f10fbadd01737c9e05422f672a727), so I would say yes. Note however that the peak RSS is going up and down between different builds and it's not related to the PR. I am running on a machine with 62GB of RAM with no constraints on the max heap size, so the GC is free to use as much RAM as it wants according to its heuristics which are not deterministic.

For instance running with 31fcfc1088363cabb618a5af7f90038a160961bf 3 times I get:

  1. 6837665792
  2. 7617609728
  3. 12387741696

(Notice that the last run has almost double the peak RSS of the first run without any chances in the code.)

gsmet commented 9 months ago

@zakkak could we either raise the threshold or disable the test for now? It's very annoying to have all our builds failing because of this.

zakkak commented 9 months ago

I guess the less evil would be to adjust the thresholds and keep this issue open until we verify the new thresholds are OK. This way we won't risk missing another increase due to other changes.

Note that the best would be to not have merged https://github.com/quarkusio/quarkus/pull/38820 in the first place without handling this since the test was already failing in the PR.

cescoffier commented 9 months ago

The PR in question does not seem related. I will have another look, but basically it is just tests.

My guess is that the true cause is the hot reload which must keep a few objects in memory, including the content of the certificates. Unfortunately, there is no solution around that.

That being said, this only happens if the certificate hot reload is enabled, which is not the case in this test. But there is a few more static methods and fields that could explain it (even if they should be very small when hot reload is not enabled).

cescoffier commented 9 months ago

Ok, to it's a build time RSS increase - not runtime. It makes more sense.

How can I produce the reports from above? The PR introduces an early usage of the Mutiny API and Vert.x scheduled tasks (which then uses Netty underneath). That may explain it.

cescoffier commented 9 months ago

I'm confused - but I may not understand what that test is actually doing.

I get: ImageMetricsITCase.verifyImageMetrics:15 Expected analysis_results.types.jni to be within range [63 +- 1%] but was 61 ==> expected: but was:

Seems related to JNI? It seems I'm under the previous value (isn't that a good thing?)

cescoffier commented 9 months ago

So.... with a more recent version of Graalvm, I got something more expected.

cescoffier commented 9 months ago

So, I'm unable to reproduce it locally with:

native-image 21 2023-09-19
GraalVM Runtime Environment GraalVM CE 21+35.1 (build 21+35-jvmci-23.1-b15)
Substrate VM GraalVM CE 21+35.1 (build 21+35, serial gc)

I can try to remove the usage of Mutiny, but it's somewhat weird, as I would need to use CS/CF, which are WAY more error-prone.

zakkak commented 9 months ago

How can I produce the reports from above?

They are generated when building the native image in a file called quarkus-integration-test-jpa-postgresql-withxml-999-SNAPSHOT-runner-build-output-stats.json

I used:

./mvnw -Dnative -pl integration-tests/jpa-postgresql-withxml clean package -Dnative.surefire.skip -Dquarkus.native.container-build=true -Dquarkus.native.container-runtime=podman

So.... with a more recent version of Graalvm, I got something more expected.

Please use the jdk-21 mandrel builder image as the thresholds are version specific.

cescoffier commented 9 months ago

Ok, I know get the same exception as you do. So, it's not a problem with GraalVM CE, but only mandrel (or the in container build using Mandrel). That will not be easy to understand.

The removal of Mutiny does not help. So, it's something else.

cescoffier commented 9 months ago

@zakkak Ok, got the test to pass:

I removed: Mutiny usage (sad), a record (useless anyway), and java stream usage.

cescoffier commented 9 months ago

@zakkak See https://github.com/quarkusio/quarkus/pull/39026

zakkak commented 9 months ago

So, it's not a problem with GraalVM CE, but only mandrel (or the in container build using Mandrel). That will not be easy to understand.

That's because Mandrel uses a different JDK as a base (OpenJDK vs LabsJDK) so there are some more differences in the json, the thresholds are meant to test the Quarkus' default builder image (i.e. Mandrel at the moment) however.

@zakkak See https://github.com/quarkusio/quarkus/pull/39026