graalvm / mandrel

Mandrel is a downstream distribution of the GraalVM community edition. Mandrel's main goal is to provide a native-image release specifically to support Quarkus.
Other
394 stars 15 forks source link

Investigate aarch64 app startup/time to serve first HTTP request #476

Open Karm opened 1 year ago

Karm commented 1 year ago

Weird perf results on our fast baremetal boxes:

https://ci.modcluster.io/view/Mandrel/job/mandrel-linux-integration-tests/955/JDK_RELEASE=ga,JDK_VERSION=11,LABEL=el8_aarch64,MANDREL_BUILD=mandrel-21-3-linux-build-matrix,QUARKUS_VERSION=2.7.6.Final/

https://ci.modcluster.io/view/Mandrel/job/mandrel-linux-integration-tests/955/JDK_RELEASE=ga,JDK_VERSION=17,LABEL=el8_aarch64,MANDREL_BUILD=mandrel-22-3-linux-build-matrix,QUARKUS_VERSION=2.13.6.Final/

It looks like it takes both Quarkus and Helidon a long time to start?

It is profoundly unexpected as the dummy startup time threshold was calibrated on a slow VM. It should be fine by a huge margin on a fast baremetal system that has nothing else to do...

jerboaa commented 1 year ago

For posterity, fails with (for 22.3):

08:35:25 Finished generating 'target/debug-symbols-smoke' in 19.8s.
08:35:36 [INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 189.234 s - in org.graalvm.tests.integration.DebugSymbolsTest
08:35:37 [INFO] 
08:35:37 [INFO] Results:
08:35:37 [INFO] 
08:35:37 [ERROR] Failures: 
08:35:37 [ERROR]   RuntimesSmokeTest.helidonQuickStart:224->testRuntime:172 Application HELIDON_QUICKSTART_SE took 176 ms to get the first OK request, which is over 100 ms threshold by 76%. ==> expected: <true> but was: <false>
08:35:37 [INFO] 
08:35:37 [ERROR] Tests run: 19, Failures: 1, Errors: 0, Skipped: 7
08:35:37 [INFO] 
08:35:37 [INFO] ------------------------------------------------------------------------
08:35:37 [INFO] Reactor Summary for Native image integration TS 1.0.0-SNAPSHOT:

The 21.3 failure is:

08:49:47 [INFO] Results:
08:49:47 [INFO] 
08:49:47 [ERROR] Failures: 
08:49:47 [ERROR]   RuntimesSmokeTest.quarkusFullMicroProfile:201->testRuntime:172 Application QUARKUS_FULL_MICROPROFILE took 319 ms to get the first OK request, which is over 300 ms threshold by 6%. ==> expected: <true> but was: <false>
08:49:47 [INFO] 
08:49:47 [ERROR] Tests run: 19, Failures: 1, Errors: 0, Skipped: 7
08:49:47 [INFO] 
08:49:47 [INFO] ------------------------------------------------------------------------
08:49:47 [INFO] Reactor Summary for Native image integration TS 1.0.0-SNAPSHOT:
08:49:47 [INFO] 
08:49:47 [INFO] Native image integration TS ........................ SUCCESS [  0.102 s]
08:49:47 [INFO] testsuite .......................................... FAILURE [13:53 min]
08:49:47 [INFO] ------------------------------------------------------------------------
08:49:47 [INFO] BUILD FAILURE
08:49:47 [INFO] ------------------------------------------------------------------------

Both run on RHEL 8, which AFAIK has 64k page size by default. We ought to run it on RHEL 9 as well in order to see if there is a difference. RHEL 9 has default of 4k on aarch64. This might explain some of the start up differences.

Karm commented 1 year ago

@jerboaa It is definitely the case. RHEL 9, getconf PAGE_SIZE 4096, has like 93% usage more RSS over the threshold (not a typo), while RHEL 8 (65536 page size) does ~10% slower over the startup threshold.

I'm reading https://www.kernel.org/doc/html/latest/arm64/memory.html ....it seems counter-intuitive to me that smaller pages would make for a bigger fragmentation? more RSS?

It's this situation though:

When the guest is 4k and the host is 64k, it only works if the guest reports 
multiple contiguous 4k pages that for a 64k page -- which is often the case, 
but not always. The host will discard a whole 64k page once it collected all 4k pages.

It is obviously not specific to Quarkus Native. I'd like to narrow it down to some eloquent recommendation we could put in writing on https://quarkus.io/guides/native-reference.

As the host is CentOS 8 and the guest is CentOS 9. I wonder that I will move to CentOS 9 altogether...

github-actions[bot] commented 1 year ago

This issue appears to be stale because it has been open 30 days with no activity. This issue will be closed in 7 days unless Stale label is removed, a new comment is made, or not-Stale label is added.