ibmruntimes / Semeru-Runtimes

Issue repo for all things IBM Semeru Runtimes
14 stars 4 forks source link

Performance impact on Liberty InstantOn server startup time when enable semeruCloudCompiler #73

Open tam512 opened 9 months ago

tam512 commented 9 months ago

Test InstantOn on Power9 OCP 4.14, when semeruCloudCompiler is not enabled, server startup time is 0.290 seconds, when semeruCloudCompiler is enabled, server startup time is 1.312 seconds

spec:
  semeruCloudCompiler:
    enable: true
    replicas: 1

CHECKPOINT RESTORE: Start and elapsed time: startTime=2650976150, elapsedTime= 5689

CHECKPOINT RESTORE: Reset start and elapsed time: startTime=2992263392, elapsedTime= 0

[2/16/24, 14:37:59:906 GMT] 00000043 com.ibm.ws.kernel.launch.internal.FrameworkManager A Launching defaultServer (Open Liberty 24.0.0.2/wlp-1.0.86.cl240220240211-1900) on Eclipse OpenJ9 VM, version 21.0.1+12-LTS (en_US) [2/16/24, 14:37:59:919 GMT] 00000043 com.ibm.ws.config.xml.internal.ConfigRefresher A CWWKG0016I: Starting server configuration update. [2/16/24, 14:37:59:920 GMT] 00000043 com.ibm.ws.config.xml.internal.ServerXMLConfiguration A CWWKG0093A: Processing configuration drop-ins resource: /opt/ol/wlp/usr/servers/defaultServer/configDropins/defaults/keystore.xml [2/16/24, 14:37:59:921 GMT] 00000043 com.ibm.ws.config.xml.internal.ServerXMLConfiguration A CWWKG0093A: Processing configuration drop-ins resource: /opt/ol/wlp/usr/servers/defaultServer/configDropins/defaults/open-default-port.xml [2/16/24, 14:37:59:923 GMT] 00000043 com.ibm.ws.config.xml.internal.ServerXMLConfiguration A CWWKG0093A: Processing configuration drop-ins resource: /opt/ol/wlp/usr/servers/defaultServer/configDropins/overrides/truststore.xml [2/16/24, 14:37:59:937 GMT] 00000043 com.ibm.ws.config.xml.internal.ConfigValidator A CWWKG0102I: Found conflicting settings for defaultSSLConfig instance of ssl configuration. Property trustDefaultCerts has conflicting values: Value true is set in file:/opt/ol/wlp/usr/servers/defaultServer/server.xml. Value ${SEC_TLS_TRUSTDEFAULTCERTS} is set in file:/opt/ol/wlp/usr/servers/defaultServer/configDropins/overrides/truststore.xml. Property trustDefaultCerts will be set to ${SEC_TLS_TRUSTDEFAULTCERTS}.

JITServer: t= 1078 Connected to a server (serverUID=11924859999837345852)

[2/16/24, 14:38:00:995 GMT] 00000045 com.ibm.ws.config.xml.internal.ConfigRefresher A CWWKG0017I: The server configuration was successfully updated in 1.078 seconds. [2/16/24, 14:38:00:997 GMT] 00000043 com.ibm.ws.http.internal.VirtualHostImpl A CWWKT0016I: Web application available (default_host): http://ebuy-olk24002-j21-0.ebuy-olk24002-j21-headless.ebuy-olo.svc.cluster.local:9080/jwt/ [2/16/24, 14:38:00:998 GMT] 00000043 com.ibm.ws.http.internal.VirtualHostImpl A CWWKT0016I: Web application available (default_host): http://ebuy-olk24002-j21-0.ebuy-olk24002-j21-headless.ebuy-olo.svc.cluster.local:9080/badapp/ [2/16/24, 14:38:00:998 GMT] 00000043 com.ibm.ws.http.internal.VirtualHostImpl A CWWKT0016I: Web application available (default_host): http://ebuy-olk24002-j21-0.ebuy-olk24002-j21-headless.ebuy-olo.svc.cluster.local:9080/openapi/ui/ [2/16/24, 14:38:00:999 GMT] 00000043 com.ibm.ws.http.internal.VirtualHostImpl A CWWKT0016I: Web application available (default_host): http://ebuy-olk24002-j21-0.ebuy-olk24002-j21-headless.ebuy-olo.svc.cluster.local:9080/eBuy-ext/ [2/16/24, 14:38:00:999 GMT] 00000043 com.ibm.ws.http.internal.VirtualHostImpl A CWWKT0016I: Web application available (default_host): http://ebuy-olk24002-j21-0.ebuy-olk24002-j21-headless.ebuy-olo.svc.cluster.local:9080/metrics/ [2/16/24, 14:38:00:999 GMT] 00000043 com.ibm.ws.http.internal.VirtualHostImpl A CWWKT0016I: Web application available (default_host): http://ebuy-olk24002-j21-0.ebuy-olk24002-j21-headless.ebuy-olo.svc.cluster.local:9080/health/ [2/16/24, 14:38:00:999 GMT] 00000043 com.ibm.ws.http.internal.VirtualHostImpl A CWWKT0016I: Web application available (default_host): http://ebuy-olk24002-j21-0.ebuy-olk24002-j21-headless.ebuy-olo.svc.cluster.local:9080/svtMessageApp/ [2/16/24, 14:38:00:999 GMT] 00000043 com.ibm.ws.http.internal.VirtualHostImpl A CWWKT0016I: Web application available (default_host): http://ebuy-olk24002-j21-0.ebuy-olk24002-j21-headless.ebuy-olo.svc.cluster.local:9080/microwebapp/ [2/16/24, 14:38:01:000 GMT] 00000043 com.ibm.ws.http.internal.VirtualHostImpl A CWWKT0016I: Web application available (default_host): http://ebuy-olk24002-j21-0.ebuy-olk24002-j21-headless.ebuy-olo.svc.cluster.local:9080/eBuy/ [2/16/24, 14:38:01:000 GMT] 00000043 com.ibm.ws.http.internal.VirtualHostImpl A CWWKT0016I: Web application available (default_host): http://ebuy-olk24002-j21-0.ebuy-olk24002-j21-headless.ebuy-olo.svc.cluster.local:9080/ibm/api/ [2/16/24, 14:38:01:000 GMT] 00000043 com.ibm.ws.http.internal.VirtualHostImpl A CWWKT0016I: Web application available (default_host): http://ebuy-olk24002-j21-0.ebuy-olk24002-j21-headless.ebuy-olo.svc.cluster.local:9080/openapi/ [2/16/24, 14:38:01:024 GMT] 00000048 com.ibm.ws.security.token.ltpa.internal.LTPAKeyCreateTask I CWWKS4105I: LTPA configuration is ready after 0.035 seconds. [2/16/24, 14:38:01:029 GMT] 00000043 io.openliberty.checkpoint.internal.CheckpointImpl A CWWKC0452I: The Liberty server process resumed operation from a checkpoint in 1.232 seconds. [2/16/24, 14:38:01:089 GMT] 00000043 com.ibm.ws.app.manager.AppMessageHelper A CWWKZ0001I: Application svtMessageApp-2.0.2 started in 1.292 seconds. [2/16/24, 14:38:01:089 GMT] 00000043 com.ibm.ws.app.manager.AppMessageHelper A CWWKZ0001I: Application badapp-2.0.1 started in 1.292 seconds. [2/16/24, 14:38:01:090 GMT] 00000043 com.ibm.ws.app.manager.AppMessageHelper A CWWKZ0001I: Application microwebapp-2.0.1 started in 1.293 seconds. [2/16/24, 14:38:01:090 GMT] 00000043 com.ibm.ws.app.manager.AppMessageHelper A CWWKZ0001I: Application ebuy started in 1.293 seconds. [2/16/24, 14:38:01:092 GMT] 00000043 com.ibm.ws.tcpchannel.internal.TCPPort I CWWKO0219I: TCP Channel defaultHttpEndpoint has been started and is now listening for requests on host (IPv6) port 9080. [2/16/24, 14:38:01:108 GMT] 00000043 com.ibm.ws.kernel.feature.internal.FeatureManager A CWWKF0012I: The server installed the following features: [appSecurity-4.0, cdi-3.0, concurrent-2.0, distributedMap-1.0, enterpriseBeansLite-4.0, expressionLanguage-4.0, jdbc-4.2, jndi-1.0, json-1.0, jsonb-2.0, jsonp-2.0, jwt-1.0, microProfile-5.0, monitor-1.0, mpConfig-3.0, mpFaultTolerance-4.0, mpHealth-4.0, mpJwt-2.0, mpMetrics-4.0, mpOpenAPI-3.0, mpOpenTracing-3.0, mpRestClient-3.0, pages-3.0, persistence-3.0, persistenceContainer-3.0, restfulWS-3.0, restfulWSClient-3.0, servlet-5.0, ssl-1.0, transportSecurity-1.0, xmlBinding-3.0]. [2/16/24, 14:38:01:109 GMT] 00000043 com.ibm.ws.kernel.feature.internal.FeatureManager I CWWKF0008I: Feature update completed in 1.312 seconds. [2/16/24, 14:38:01:109 GMT] 00000043 com.ibm.ws.kernel.feature.internal.FeatureManager A CWWKF0011I: The defaultServer server is ready to run a smarter planet. The defaultServer server started in 1.312 seconds. [2/16/24, 14:38:01:261 GMT] 0000003f com.ibm.ws.ssl.config.WSKeyStore I Successfully loaded default keystore: /opt/ol/wlp/output/defaultServer/resources/security/key.p12 of type: PKCS12 [2/16/24, 14:38:01:292 GMT] 0000003f com.ibm.ws.security.mp.jwt.impl.MicroProfileJwtServiceImpl I CWWKS5500I: The MicroProfile JWT configuration [MicroProfileJwtService] was successfully processed. [2/16/24, 14:38:01:293 GMT] 0000003f com.ibm.ws.security.mp.jwt.impl.MicroProfileJwtConfigImpl I CWWKS5500I: The MicroProfile JWT configuration [defaultMpJwt] was successfully processed. [2/16/24, 14:38:01:308 GMT] 00000066 com.ibm.ws.tcpchannel.internal.TCPPort I CWWKO0219I: TCP Channel defaultHttpEndpoint-ssl has been started and is now listening for requests on host (IPv6) port 9443.

- lscpu info on the Power9 OCP  

sh-4.4# lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 8 NUMA node(s): 1 Model: 2.3 (pvr 004e 1203) Model name: POWER9 (architected), altivec supported Hypervisor vendor: KVM Virtualization type: para L1d cache: 32K L1i cache: 32K NUMA node0 CPU(s): 0-7



- I also tested InstantOn + semeruCloudCompiler on Azure Kubernetes Service (AKS), and see less performance impact.  On AKS, server startup time without semeruCloudCompiler was 0.299 seconds and 0.341 seconds with semeruCloudCompiler
mpirvu commented 9 months ago

@tam512 Could you please post a similar for x86 (Azure)?

From the Power log we see that more than 1 second is spent between these events:

[2/16/24, 14:37:59:937 GMT] 00000043 com.ibm.ws.config.xml.internal.ConfigValidator               A CWWKG0102I: Found conflicting settings for defaultSSLConfig instance of ssl configuration.
  Property trustDefaultCerts has conflicting values:
    Value true is set in file:/opt/ol/wlp/usr/servers/defaultServer/server.xml.
    Value ${SEC_TLS_TRUSTDEFAULTCERTS} is set in file:/opt/ol/wlp/usr/servers/defaultServer/configDropins/overrides/truststore.xml.
  Property trustDefaultCerts will be set to ${SEC_TLS_TRUSTDEFAULTCERTS}.

#JITServer: t=  1078 Connected to a server (serverUID=11924859999837345852)
[2/16/24, 14:38:00:995 GMT] 00000045 com.ibm.ws.config.xml.internal.ConfigRefresher               A CWWKG0017I: The server configuration was successfully updated in 1.078 seconds.

I wonder if Found conflicting settings for defaultSSLConfig instance of ssl configuration appears on x86 as well.

In between those two Liberty events there is a (successful) attempt to connect to the server. It's not clear to me whether that's creating a problem.

tam512 commented 9 months ago

Log from AKS

[2/16/24, 15:35:54:184 GMT] 00000037 com.ibm.ws.config.xml.internal.ServerXMLConfiguration        A CWWKG0093A: Processing configuration drop-ins resource: /opt/ol/wlp/usr/servers/defaultServer/configDropins/defaults/open-default-port.xml
[2/16/24, 15:35:54:185 GMT] 00000037 com.ibm.ws.config.xml.internal.ServerXMLConfiguration        A CWWKG0093A: Processing configuration drop-ins resource: /opt/ol/wlp/usr/servers/defaultServer/configDropins/overrides/truststore.xml
#JITServer: t=    30 Connected to a server (serverUID=11837530244278472918)
[2/16/24, 15:35:54:195 GMT] 00000037 com.ibm.ws.config.xml.internal.ConfigValidator               A CWWKG0102I: Found conflicting settings for defaultSSLConfig instance of ssl configuration.
  Property trustDefaultCerts has conflicting values:
    Value true is set in file:/opt/ol/wlp/usr/servers/defaultServer/server.xml.
    Value ${SEC_TLS_TRUSTDEFAULTCERTS} is set in file:/opt/ol/wlp/usr/servers/defaultServer/configDropins/overrides/truststore.xml.
  Property trustDefaultCerts will be set to ${SEC_TLS_TRUSTDEFAULTCERTS}.

[2/16/24, 15:35:54:213 GMT] 0000003a com.ibm.ws.config.xml.internal.ConfigRefresher               A CWWKG0017I: The server configuration was successfully updated in 0.032 seconds.
mpirvu commented 9 months ago

So, the server configuration update takes 1078 ms on Power and only 32 ms on x86.

tam512 commented 9 months ago

On x86 OCP 4.14 cluster, I also see this performance issue. Here is partial log

#CHECKPOINT RESTORE: Start and elapsed time: startTime=3273250126, elapsedTime=  7959
#CHECKPOINT RESTORE: Reset start and elapsed time: startTime=3537997125, elapsedTime=     0
[2/22/24, 22:13:34:709 UTC] 00000036 com.ibm.ws.kernel.launch.internal.FrameworkManager           A Launching defaultServer (Open Liberty 24.0.0.2/wlp-1.0.86.cl240220240212-1928) on Eclipse OpenJ9 VM, version 17.0.10+7 (en_US)
#JITServer: t=  1069 Connected to a server (serverUID=1909185955827864273)
[2/22/24, 22:13:34:716 UTC] 00000036 com.ibm.ws.config.xml.internal.ConfigRefresher               A CWWKG0016I: Starting server configuration update.
[2/22/24, 22:13:34:716 UTC] 00000036 com.ibm.ws.config.xml.internal.ServerXMLConfiguration        A CWWKG0093A: Processing configuration drop-ins resource: /opt/ol/wlp/usr/servers/defaultServer/configDropins/defaults/keystore.xml
[2/22/24, 22:13:34:717 UTC] 00000036 com.ibm.ws.config.xml.internal.ServerXMLConfiguration        A CWWKG0093A: Processing configuration drop-ins resource: /opt/ol/wlp/usr/servers/defaultServer/configDropins/defaults/open-default-port.xml
[2/22/24, 22:13:34:717 UTC] 00000036 com.ibm.ws.config.xml.internal.ServerXMLConfiguration        A CWWKG0093A: Processing configuration drop-ins resource: /opt/ol/wlp/usr/servers/defaultServer/configDropins/overrides/truststore.xml
[2/22/24, 22:13:34:724 UTC] 00000036 com.ibm.ws.config.xml.internal.ConfigValidator               A CWWKG0102I: Found conflicting settings for defaultSSLConfig instance of ssl configuration.
  Property trustDefaultCerts has conflicting values:
    Value true is set in file:/opt/ol/wlp/usr/servers/defaultServer/server.xml.
    Value ${SEC_TLS_TRUSTDEFAULTCERTS} is set in file:/opt/ol/wlp/usr/servers/defaultServer/configDropins/overrides/truststore.xml.
  Property trustDefaultCerts will be set to ${SEC_TLS_TRUSTDEFAULTCERTS}.
.............
[2/22/24, 22:13:34:779 UTC] 00000036 com.ibm.ws.kernel.feature.internal.FeatureManager            A CWWKF0011I: The defaultServer server is ready to run a smarter planet. The defaultServer server started in 1.262 seconds.
mpirvu commented 8 months ago

Running in a Microk8s Kubernetes cluster with LibertyKernel 24.0.0.2 and AcmeAirEE8 and InstantON and OpenJ9 Java21 private build and OLO I get

#CHECKPOINT RESTORE: Start and elapsed time: startTime=801781340, elapsedTime=  3018
#CHECKPOINT RESTORE: Reset start and elapsed time: startTime=806529622, elapsedTime=  3018
#CHECKPOINT RESTORE: Resuming IProfiler Thread from Checkpoint
#CHECKPOINT RESTORE: Resuming Sampler Thread from Checkpoint
#CHECKPOINT RESTORE: Resetting Sampling Thread Lifetime State
#CHECKPOINT RESTORE: Resetting IProfier Thread Lifetime State
#JITServer: t=  3038 Connected to a server (serverUID=14221984176625741787)
[AUDIT   ] Launching defaultServer (Open Liberty 24.0.0.2/wlp-1.0.86.cl240220240212-1928) on Eclipse OpenJ9 VM, version 21.0.3-internal-adhoc.root.openj9-openjdk-jdk21-ssl (en_US)
[AUDIT   ] CWWKG0016I: Starting server configuration update.
[AUDIT   ] CWWKG0093A: Processing configuration drop-ins resource: /opt/ol/wlp/usr/servers/defaultServer/configDropins/defaults/keystore.xml
[AUDIT   ] CWWKG0093A: Processing configuration drop-ins resource: /opt/ol/wlp/usr/servers/defaultServer/configDropins/defaults/open-default-port.xml
[AUDIT   ] CWWKG0093A: Processing configuration drop-ins resource: /opt/ol/wlp/usr/servers/defaultServer/configDropins/overrides/truststore.xml
[AUDIT   ] CWWKT0016I: Web application available (default_host): http://acmeair-olo-78d8b6658-zxl9x:9080/
[AUDIT   ] CWWKG0017I: The server configuration was successfully updated in 0.032 seconds.
[AUDIT   ] CWWKC0452I: The Liberty server process resumed operation from a checkpoint in 0.322 seconds.
[AUDIT   ] CWWKZ0001I: Application acmeair-monolithic started in 0.326 seconds.
[INFO    ] CWWKO0219I: TCP Channel defaultHttpEndpoint has been started and is now listening for requests on host *  (IPv4) port 9080.
[AUDIT   ] CWWKF0012I: The server installed the following features: [cdi-2.0, jaxrs-2.1, jaxrsClient-2.1, jndi-1.0, jsonp-1.1, servlet-4.0].
[INFO    ] CWWKF0008I: Feature update completed in 0.341 seconds.
[AUDIT   ] CWWKF0011I: The defaultServer server is ready to run a smarter planet. The defaultServer server started in 0.341 seconds.
[AUDIT   ] CWWKG0016I: Starting server configuration update.
[AUDIT   ] CWWKG0093A: Processing configuration drop-ins resource: /opt/ol/wlp/usr/servers/defaultServer/configDropins/defaults/keystore.xml
[AUDIT   ] CWWKG0093A: Processing configuration drop-ins resource: /opt/ol/wlp/usr/servers/defaultServer/configDropins/defaults/open-default-port.xml
[AUDIT   ] CWWKG0093A: Processing configuration drop-ins resource: /opt/ol/wlp/usr/servers/defaultServer/configDropins/overrides/truststore.xml
[AUDIT   ] CWWKG0018I: The server configuration was not updated. No functional changes were detected.
[AUDIT   ] CWWKT0017I: Web application removed (default_host): http://acmeair-olo-78d8b6658-zxl9x:9080/
[AUDIT   ] CWWKZ0009I: The application acmeair-monolithic has stopped successfully.
[INFO    ] CWWKZ0018I: Starting application acmeair-monolithic.
[INFO    ] CWWKZ0136I: The acmeair-monolithic application is using the archive file at the /opt/ol/wlp/usr/servers/defaultServer/apps/acmeair-java-2.0.0-SNAPSHOT.war location.
[WARNING ] CWOWB1001W: Web Application Archive acmeair-java-2.0.0-SNAPSHOT#acmeair-java-2.0.0-SNAPSHOT.war contains more than one beans.xml file. Using WEB-INF/beans.xml. Ignoring WEB-INF/classes/META-INF/beans.xml.
[INFO    ] SRVE0169I: Loading Web Module: acmeair-webapp.
[INFO    ] SRVE0250I: Web Module acmeair-webapp has been bound to default_host.
[AUDIT   ] CWWKT0016I: Web application available (default_host): http://acmeair-olo-78d8b6658-zxl9x:9080/
[INFO    ] SESN0176I: A new session context will be created for application key default_host/
[INFO    ] SESN0172I: The session manager is using the Java default SecureRandom implementation for session ID generation.
[AUDIT   ] CWWKZ0003I: The application acmeair-monolithic updated in 0.364 seconds.

so I cannot reproduce the issue.
@tam512 could you please send me the yaml file that you deploy in this test? Thanks

ymanton commented 8 months ago

@mpirvu noticed a discrepancy in the logs; the timestamps line up with the reported times on Power, but not on x86.

On Power:

Found mounted TLS certificates, generating keystore

#CHECKPOINT RESTORE: Start and elapsed time: startTime=2650976150, elapsedTime=  5689
#CHECKPOINT RESTORE: Reset start and elapsed time: startTime=2992263392, elapsedTime=     0
[2/16/24, 14:37:59:906 GMT] 00000043 com.ibm.ws.kernel.launch.internal.FrameworkManager           A Launching defaultServer (Open Liberty 24.0.0.2/wlp-1.0.86.cl240220240211-1900) on Eclipse OpenJ9 VM, version 21.0.1+12-LTS (en_US)

...

[2/16/24, 14:38:00:995 GMT] 00000045 com.ibm.ws.config.xml.internal.ConfigRefresher               A CWWKG0017I: The server configuration was successfully updated in 1.078 seconds.

...

[2/16/24, 14:38:01:029 GMT] 00000043 io.openliberty.checkpoint.internal.CheckpointImpl            A CWWKC0452I: The Liberty server process resumed operation from a checkpoint in 1.232 seconds.
[2/16/24, 14:38:01:089 GMT] 00000043 com.ibm.ws.app.manager.AppMessageHelper                      A CWWKZ0001I: Application svtMessageApp-2.0.2 started in 1.292 seconds.
[2/16/24, 14:38:01:089 GMT] 00000043 com.ibm.ws.app.manager.AppMessageHelper                      A CWWKZ0001I: Application badapp-2.0.1 started in 1.292 seconds.
[2/16/24, 14:38:01:090 GMT] 00000043 com.ibm.ws.app.manager.AppMessageHelper                      A CWWKZ0001I: Application microwebapp-2.0.1 started in 1.293 seconds.
[2/16/24, 14:38:01:090 GMT] 00000043 com.ibm.ws.app.manager.AppMessageHelper                      A CWWKZ0001I: Application ebuy started in 1.293 seconds.

...

[2/16/24, 14:38:01:109 GMT] 00000043 com.ibm.ws.kernel.feature.internal.FeatureManager            I CWWKF0008I: Feature update completed in 1.312 seconds.
[2/16/24, 14:38:01:109 GMT] 00000043 com.ibm.ws.kernel.feature.internal.FeatureManager            A CWWKF0011I: The defaultServer server is ready to run a smarter planet. The defaultServer server started in 1.312 seconds.

Time timestamps go from 14:37:59:906 to 14:38:01:109 which is 1.203s, which is in line with the reported times.

For x86:

[2/22/24, 22:13:34:709 UTC] 00000036 com.ibm.ws.kernel.launch.internal.FrameworkManager           A Launching defaultServer (Open Liberty 24.0.0.2/wlp-1.0.86.cl240220240212-1928) on Eclipse OpenJ9 VM, version 17.0.10+7 (en_US)

...

[2/22/24, 22:13:34:779 UTC] 00000036 com.ibm.ws.kernel.feature.internal.FeatureManager            A CWWKF0011I: The defaultServer server is ready to run a smarter planet. The defaultServer server started in 1.262 seconds.

The timestamps show 70ms elapsed but the reported times are similar to Power. Not exactly sure how to reconcile this. Is it possible for the log lines to be internally buffered and therefore delayed on x86? 70ms elapsed seems too fast for all the post-restore activity to have happened. @tjwatson any insight?

ymanton commented 8 months ago

The reported time looks accurate, regardless of the timestamps on the log, from what I can tell.

The reported time is calculated by Liberty by taking the current time and subtracting the timestamp in ${SERVER_OUTPUT_DIR}/workarea/checkpoint/restoreTime, which is created by the server script shortly before criu restore ... is invoked.

I can see from the CRIU restore log that the process is restored in about 300 ms, so the remaining time must be taken by the JVM and Liberty doing post-restore activities. I would guess that latency between the application JVM and compilation server is the significant factor here, but where that latency is coming from isn't clear to me. As far as I can tell the application JVM and cloud compiler containers are actually on the same OCP worker, and therefore the same physical machine, so there is no real network between the two.

mpirvu commented 8 months ago

@ymanton how does the log look like when JITServer is not enabled?

tjwatson commented 8 months ago

The timestamps show 70ms elapsed but the reported times are similar to Power. Not exactly sure how to reconcile this. Is it possible for the log lines to be internally buffered and therefore delayed on x86? 70ms elapsed seems too fast for all the post-restore activity to have happened. @tjwatson any insight?

Sorry for the late reply. I would attribute this to the fact that the server script for Liberty records the start time before invoking criu restore. We do not print that first log message with the initial timestamp until after criu has restored the process and the JVM has completed the JVM restore process and hands execution back to Liberty to log that first message. So that time may be the discrepancy you observe here.

mpirvu commented 8 months ago

@tam512 If I gave you a base liberty image with a customized JDK would you be able to create the Liberty+Apps image to try in this OpenShift environment? Alternatively, I would need some info on how to package the apps myself.

tam512 commented 8 months ago

@mpirvu yes I can build a new app image with your customized Liberty image

mpirvu commented 8 months ago

A new log from @tam512 shows

Found mounted TLS certificates, generating keystore

#CHECKPOINT RESTORE: processOptionsPostRestore took 8 ms
#CHECKPOINT RESTORE: Start and elapsed time: startTime=1575235713, elapsedTime=  9552
#CHECKPOINT RESTORE: Reset start and elapsed time: startTime=1636366157, elapsedTime=  9552
#CHECKPOINT RESTORE: Resuming Sampler Thread from Checkpoint
#CHECKPOINT RESTORE: Resuming IProfiler Thread from Checkpoint
#CHECKPOINT RESTORE: Resetting Sampling Thread Lifetime State
#CHECKPOINT RESTORE: Resetting IProfier Thread Lifetime State
[3/21/24, 15:02:39:526 UTC] 00000037 com.ibm.ws.kernel.launch.internal.FrameworkManager           A Launching defaultServer (Open Liberty 24.0.0.2/wlp-1.0.86.cl240220240212-1928) on Eclipse OpenJ9 VM, version 17.0.11-internal+0-adhoc..openj9-openjdk-jdk17 (en_US)
[3/21/24, 15:02:39:535 UTC] 00000037 com.ibm.ws.config.xml.internal.ConfigRefresher               A CWWKG0016I: Starting server configuration update.
[3/21/24, 15:02:39:536 UTC] 00000037 com.ibm.ws.config.xml.internal.ServerXMLConfiguration        A CWWKG0093A: Processing configuration drop-ins resource: /opt/ol/wlp/usr/servers/defaultServer/configDropins/defaults/keystore.xml
[3/21/24, 15:02:39:536 UTC] 00000037 com.ibm.ws.config.xml.internal.ServerXMLConfiguration        A CWWKG0093A: Processing configuration drop-ins resource: /opt/ol/wlp/usr/servers/defaultServer/configDropins/defaults/open-default-port.xml
[3/21/24, 15:02:39:537 UTC] 00000037 com.ibm.ws.config.xml.internal.ServerXMLConfiguration        A CWWKG0093A: Processing configuration drop-ins resource: /opt/ol/wlp/usr/servers/defaultServer/configDropins/overrides/truststore.xml
[3/21/24, 15:02:39:549 UTC] 00000037 com.ibm.ws.config.xml.internal.ConfigValidator               A CWWKG0102I: Found conflicting settings for defaultSSLConfig instance of ssl configuration.
  Property trustDefaultCerts has conflicting values:
    Value true is set in file:/opt/ol/wlp/usr/servers/defaultServer/server.xml.
    Value ${SEC_TLS_TRUSTDEFAULTCERTS} is set in file:/opt/ol/wlp/usr/servers/defaultServer/configDropins/overrides/truststore.xml.
  Property trustDefaultCerts will be set to ${SEC_TLS_TRUSTDEFAULTCERTS}.

#JITServer: t= 10594 Connected to a server (serverUID=15100760048119106555)

processOptionsPostRestore took only 8 ms so this rules out the SSL initialization (keys ans such) that is done once post restore. There is a 1 sec delay between the moment we setup the internal clock (9552 ms) and the time the first SSL connection is established (10594). Additional logging should show whether this comes from establishing the SSL connection.

mpirvu commented 8 months ago

More instrumentation shows that the vast majority of the delay comes from establishing a typical TCP connection to the server.

#JITServer: Performed connect() in 1016 ms  delta=1016815 us
#JITServer: SSL connection timing: BIO_new_ssl: 33 us, BIO_ctrl: 0 us, SSL_set_fd: 4 us, SSL_connect: 3433 us, get_peer_certificate: 0 us, get_verify_result: 0 us
ymanton commented 8 months ago

Some new data:

First, here's the breakdown of the compilation activity between when the process is restored and when the server reports startup is complete:

InstantOn + Cloud Compiler:

     # opt-level time (us) method-name
     1  cold 1083468 jdk/internal/reflect/AccessorGenerator.getClassName(Ljava/lang/Class;Z)Ljava/lang/String;
     2  cold 3352 sun/nio/ch/FileDispatcherImpl.size0(Ljava/io/FileDescriptor;)J
     3  cold 2295 sun/nio/ch/FileDispatcherImpl.seek0(Ljava/io/FileDescriptor;J)J
     4  cold 6116 com/ibm/ws/config/xml/internal/variables/ConfigVariableRegistry.lookupVariable(Ljava/lang/String;)Ljava/lang/String;
     5  cold 6160 com/ibm/ws/config/xml/internal/ConfigComparator.hasVariable(Ljava/lang/String;Ljava/util/Map;)Z
     6  cold 2774 org/eclipse/equinox/metatype/impl/AttributeDefinitionImpl.getType()I
     7  cold 1042 org/eclipse/equinox/metatype/impl/AttributeDefinitionImpl.getCardinality()I
     8  cold 17363 com/ibm/ws/config/xml/internal/VariableEvaluator.processVariableLists(Ljava/lang/Object;Lcom/ibm/ws/config/xml/internal/metatype/ExtendedAttributeDefinition;Lcom/ibm/ws/config/xml/internal/EvaluationContext;Z)Ljava/lang/Object;
     9  cold 5751 openj9/internal/tools/attach/target/FileLock.lockFileImpl(Ljava/lang/String;IZ)J
    10  cold 3345 openj9/internal/tools/attach/target/IPC.openSemaphore(Ljava/lang/String;Ljava/lang/String;Z)I
    11  cold 809 openj9/internal/tools/attach/target/FileLock.unlockFileImpl(J)I
    12  cold 774 openj9/internal/tools/attach/target/IPC.mkdirWithPermissionsImpl(Ljava/lang/String;I)I
    13  cold 755 openj9/internal/tools/attach/target/IPC.chmod(Ljava/lang/String;I)I
    14  cold 785 openj9/internal/tools/attach/target/IPC.createFileWithPermissionsImpl(Ljava/lang/String;I)I
    15  cold 747 openj9/internal/tools/attach/target/IPC.waitSemaphore()I
    16  cold 4808 com/ibm/ws/config/xml/internal/ConfigEvaluator$EvaluationResult.getProperties()Ljava/util/Dictionary;
    17  cold 2880 java/net/NetworkInterface.getAll()[Ljava/net/NetworkInterface;
    18  cold 2690 java/net/NetworkInterface.isUp0(Ljava/lang/String;I)Z
    19  AOT load 143 sun/nio/ch/Util$BufferCache.next(I)I
    20  cold 2357 java/net/NetworkInterface.isLoopback0(Ljava/lang/String;I)Z
    21  cold 2253 java/net/NetworkInterface.isP2P0(Ljava/lang/String;I)Z
    22  cold 3522 java/net/Inet6AddressImpl.lookupAllHostAddr(Ljava/lang/String;)[Ljava/net/InetAddress;
    23  cold 5244 com/ibm/ws/config/xml/internal/EvaluationContext.getConfigElement()Lcom/ibm/ws/config/xml/internal/ConfigElement;
    24  cold 10389 com/ibm/ejs/util/FastStack.peek()Ljava/lang/Object;
    25  AOT load 320 java/security/Provider$Service.newInstance(Ljava/lang/Object;)Ljava/lang/Object;
    26  cold 7971 java/lang/ProcessEnvironment$ExternalData.<init>(Ljava/lang/String;[B)V
    27  cold 5742 java/lang/invoke/LambdaForm$DMH/0x00000000684f1020.invokeStatic(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;
    28  cold 3479 java/lang/invoke/MethodHandles$Lookup$ClassFile.readUnsignedShort([BI)I

InstantOn only

     # opt-level time (us) method-name
     1  cold 4638 jdk/internal/reflect/AccessorGenerator.getClassName(Ljava/lang/Class;Z)Ljava/lang/String;
     2  cold 385 sun/nio/ch/FileDispatcherImpl.size0(Ljava/io/FileDescriptor;)J
     3  cold 301 sun/nio/ch/FileDispatcherImpl.seek0(Ljava/io/FileDescriptor;J)J
     4  cold 832 com/ibm/ws/config/xml/internal/variables/ConfigVariableRegistry.lookupVariable(Ljava/lang/String;)Ljava/lang/String;
     5  cold 834 com/ibm/ws/config/xml/internal/ConfigComparator.hasVariable(Ljava/lang/String;Ljava/util/Map;)Z
     6  cold 218 org/eclipse/equinox/metatype/impl/AttributeDefinitionImpl.getType()I
     7  cold 255 org/eclipse/equinox/metatype/impl/AttributeDefinitionImpl.getCardinality()I
     8  cold 279 com/ibm/ws/config/xml/internal/ConfigEvaluator$EvaluationResult.getProperties()Ljava/util/Dictionary;
     9  cold 325 openj9/internal/tools/attach/target/FileLock.lockFileImpl(Ljava/lang/String;IZ)J
    10  cold 282 openj9/internal/tools/attach/target/IPC.openSemaphore(Ljava/lang/String;Ljava/lang/String;Z)I
    11  cold 246 openj9/internal/tools/attach/target/FileLock.unlockFileImpl(J)I
    12  cold 364 openj9/internal/tools/attach/target/IPC.mkdirWithPermissionsImpl(Ljava/lang/String;I)I
    13  cold 261 openj9/internal/tools/attach/target/IPC.chmod(Ljava/lang/String;I)I
    14  cold 251 openj9/internal/tools/attach/target/IPC.createFileWithPermissionsImpl(Ljava/lang/String;I)I
    15  cold 238 openj9/internal/tools/attach/target/IPC.waitSemaphore()I
    16  cold 492 com/ibm/ejs/util/FastStack.peek()Ljava/lang/Object;
    17  cold 1678 java/lang/ProcessEnvironment$ExternalData.<init>(Ljava/lang/String;[B)V
    18  cold 664 java/net/NetworkInterface.getAll()[Ljava/net/NetworkInterface;
    19  AOT load 84 sun/nio/ch/Util$BufferCache.next(I)I
    20  cold 951 java/net/NetworkInterface.isUp0(Ljava/lang/String;I)Z
    21  cold 638 java/net/NetworkInterface.isLoopback0(Ljava/lang/String;I)Z
    22  cold 344 java/net/NetworkInterface.isP2P0(Ljava/lang/String;I)Z
    23  cold 282 java/net/Inet6AddressImpl.lookupAllHostAddr(Ljava/lang/String;)[Ljava/net/InetAddress;
    24  cold 722 java/lang/invoke/LambdaForm$DMH/0x00000000684f1020.invokeStatic(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;
    25  cold 2744 java/lang/invoke/MethodHandles$Lookup$ClassFile.readUnsignedShort([BI)I
    26  cold 510 java/security/Provider$Service.getAlgorithm()Ljava/lang/String;
    27  cold 603 java/security/Provider$Service.getAliases()Ljava/util/List;
    28  cold 1355 java/security/Provider.checkInitialized()V
    29  AOT load 2525 java/security/Provider$Service.newInstance(Ljava/lang/Object;)Ljava/lang/Object;
    30  cold 788 java/lang/invoke/LambdaForm.parameter(I)Ljava/lang/invoke/LambdaForm$Name;
    31  cold 1424 com/sun/crypto/provider/AESCrypt.mul4(I[B)I
    32  cold 415 sun/nio/ch/Net.bind0(Ljava/io/FileDescriptor;ZZLjava/net/InetAddress;I)V
    33  cold 326 sun/nio/ch/Net.listen(Ljava/io/FileDescriptor;I)V
    34  cold 437 sun/nio/ch/Net.localInetAddress(Ljava/io/FileDescriptor;)Ljava/net/InetAddress;
    35  cold 410 sun/nio/ch/Net.localPort(Ljava/io/FileDescriptor;)I
    36  cold 263 sun/nio/ch/EPoll.wait(IJII)I
    37  cold 301 jdk/crypto/jniprovider/NativeCrypto.CreateContext()J
    38  cold 395 jdk/crypto/jniprovider/NativeCrypto.DigestCreateContext(JI)J
    39  AOT load 84 javax/crypto/Mac.chooseFirstProvider()V
    40  AOT load 39 sun/security/provider/NativeDigest.engineUpdate([BII)V
    41  AOT load 38 sun/security/provider/NativeDigest.engineDigest([BII)I
    42  AOT load 112 javax/crypto/Mac.update([B)V
    43  AOT load 104 com/sun/crypto/provider/HmacCore.engineUpdate([BII)V
    44  AOT load 46 java/security/MessageDigest.update([BII)V
    45  AOT load 66 javax/crypto/Mac.getMacLength()I
    46  AOT load 37 com/sun/crypto/provider/HmacCore.engineGetMacLength()I
    47  AOT load 41 java/security/MessageDigest.getDigestLength()I
    48  AOT load 27 java/security/MessageDigest$Delegate.engineGetDigestLength()I
    49  AOT load 20 sun/security/provider/NativeDigest.engineGetDigestLength()I
    50  AOT load 45 sun/security/provider/NativeDigest.engineDigest()[B
    51  AOT load 26 com/sun/crypto/provider/HmacCore.engineReset()V
    52  AOT load 84 javax/crypto/Mac.doFinal([BI)V
    53  AOT load 35 javax/crypto/Mac.doFinal()[B
    54  AOT load 86 com/sun/crypto/provider/HmacCore.engineDoFinal()[B
    55  AOT load 62 java/security/MessageDigest.digest([BII)I
    56  AOT load 44 java/security/MessageDigest$Delegate.engineDigest([BII)I
    57  cold 276 jdk/crypto/jniprovider/NativeCrypto.DigestUpdate(J[BII)I
    58  cold 285 jdk/crypto/jniprovider/NativeCrypto.DigestComputeAndReset(J[BII[BII)I
    59  AOT load 90 com/sun/crypto/provider/PBKDF2KeyImpl.deriveKey(Ljavax/crypto/Mac;[B[BII)[B
    60  cold 368 jdk/crypto/jniprovider/NativeCrypto.CBCUpdate(J[BII[BI)I
    61  cold 352 jdk/crypto/jniprovider/NativeCrypto.CBCInit(JI[BI[BIZ)I
    62  cold 3616 sun/security/util/DerValue.<init>([BIIZZ)V
    63  cold 342 jdk/crypto/jniprovider/NativeCrypto.PBEDerive([BI[BI[BIIII)I
    64  cold 2554 com/sun/crypto/provider/RC2Crypt.decryptBlock([BI[BI)V
    65  cold 4158 sun/security/util/DerInputStream.<init>([BIIZ)V
    66  cold 357 jdk/internal/misc/Unsafe.copySwapMemory0(Ljava/lang/Object;JLjava/lang/Object;JJJ)V
    67  cold 554 sun/security/util/DerValue.data()Lsun/security/util/DerInputStream;
    68  cold 345 sun/nio/ch/EventFD.set0(I)I
    69  cold 321 sun/nio/ch/IOUtil.drain(I)Z
    70  cold 1492 java/util/TreeMap$EntrySet.<init>(Ljava/util/TreeMap;)V
    71  cold 1094 org/eclipse/equinox/metatype/impl/ExtendableHelper.getExtensionUris()Ljava/util/Set;
    72  cold 1028 com/ibm/ws/config/xml/internal/metatype/ExtendedAttributeDefinitionImpl.<init>(Lorg/osgi/service/metatype/AttributeDefinition;)V
    73  cold 6239 java/util/concurrent/Phaser.doRegister(I)I
    74  cold 3985 java/util/concurrent/Phaser.doArrive(I)I
    75  cold 1334 com/ibm/ws/kernel/filemonitor/internal/MonitorHolder.scanForUpdates(Ljava/util/Collection;Ljava/util/Collection;Ljava/util/Collection;)V
    76  cold 878 java/lang/StrictMath.sqrt(D)D
    77  cold 3055 java/lang/invoke/VarHandleLongs$FieldInstanceReadWrite.compareAndSet(Ljava/lang/invoke/VarHandle;Ljava/lang/Object;JJ)Z
    78  cold 2040 java/lang/invoke/LambdaForm$DMH/0x0000000024042a60.invokeStatic(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;JJ)I

The slow connection time is reflected in the first compilation, but it looks like all subsequent non-AOT compilations are also slower, which suggests that all communication between the JVM and CC is slower. There are also far fewer AOT loads, which I assume is because they're stuck in the queue while the compilation threads on the JVM are waiting on the CC.

Strangely enough, communication between the JVM container and CC container is not particularly slow. I verified this by running two netcat instances in the containers and timing the communication between them. Communication between the JVM and CC is also not particularly slow in the same containers. To verify this I ran java -version in the JVM container and pointed it at the corresponding CC instance and did not see any latency; the connection and compilations were much faster.

The above suggests that the restored JVM in particular is the only one suffering from this problem, but only on this particular OCP instance. OCP instances on other clouds, and even other OCP instances on this cloud don't show the problem. I'll continue to try to isolate the problem.

ymanton commented 8 months ago

Another finding, adding -Xjit:exclude={*} to the restore options results in a startup of 0.7s, which is a slowdown relative to just InstantOn, but an improvement relative to the regression we're tracking here.

The option shuts down the compiler and prevents connections to the remote compiler, and also throws out all of the previously compiled code. Slower startup is to be expected, but the fact that it's faster than with remote compilation enabled is definitely not expected. Remote compilation is somehow slowing things down beyond just the effects of fewer JIT compiled and AOT loaded methods.

ymanton commented 7 months ago

This problem appears to be independent of InstantOn and maybe even Semeru.

If we disable InstantOn and start the server conventionally we still see the same 1-second connection time. With remote compilation once the compiler is idle we close the connection and later re-open if required, and here I can see that these subsequent connections are much faster, so the issue seems to be specifically tied to the first connection.

I would suspect that there is probably nothing specific to Semeru here, and that any such first connection from any other program will show the same problem, but I would have to find a way to test that theory.

 (cold) Compiling java/lang/System.getSysPropBeforePropertiesInitialized(I)Ljava/lang/String;  OrdinaryMethod j9m=0000000000052350 remote t=0 compThreadID=0 memLimit=262144 KB freePhysicalMemory=3866 MB
#JITSTATE:  t=  1010 Changing maxIProfilingCount to 3000
#JITServer: Socket connection timing: getaddrinfo: 3616 us, socket: 16 us, setsockopt: 6 us, connect: 1032264 us addrCount=1
#JITServer: Performed connect() in 1032 ms  delta=1035940 us
#JITServer: SSL connection on socket 0x9, Version: TLSv1.3, Cipher: TLS_AES_256_GCM_SHA384
#JITServer: SSL connection timing: BIO_new_ssl: 43 us, BIO_ctrl: 0 us, SSL_set_fd: 7 us, SSL_connect: 3360 us, get_peer_certificate: 0 us, get_verify_result: 0 us
#JITServer: Client sending compReq seqNo=1 to server for method java/lang/System.getSysPropBeforePropertiesInitialized(I)Ljava/lang/String; @ cold.
#JITServer: t=  1072 Connected to a server (serverUID=7147840746617256867)
#JITServer: Client successfully loaded method java/lang/System.getSysPropBeforePropertiesInitialized(I)Ljava/lang/String; @ cold following compilation request. [metaData=00007F547C814038, startPC=00007F54841C903C]
+ (cold) java/lang/System.getSysPropBeforePropertiesInitialized(I)Ljava/lang/String; @ 00007F54841C903C-00007F54841C9172 OrdinaryMethod - Q_SZ=353 Q_SZI=353 QW=387 j9m=0000000000052350 bcsz=3 JNI remote time=1088608us mem=[region=256 system=2048]KB compThreadID=0 CpuLoad=0%(0%avg) JvmCpu=75% queueTime=1088780us

...

#JITServer: Client sending compReq seqNo=4 to server for method java/net/URL.fromURI(Ljava/net/URI;)Ljava/net/URL; @ cold.
#JITServer: Socket connection timing: getaddrinfo: 5112 us, socket: 30 us, setsockopt: 8 us, connect: 345 us addrCount=1
#JITServer: Performed connect() in 15 ms  delta=5619 us
#JITServer: SSL connection on socket 0xb, Version: TLSv1.3, Cipher: TLS_AES_256_GCM_SHA384
#JITServer: SSL connection timing: BIO_new_ssl: 36 us, BIO_ctrl: 0 us, SSL_set_fd: 6 us, SSL_connect: 41080 us, get_peer_certificate: 2 us, get_verify_result: 0 us
#JITServer: Client sending compReq seqNo=5 to server for method java/net/URL.<init>(Ljava/lang/String;Ljava/lang/String;ILjava/lang/String;Ljava/net/URLStreamHandler;)V @ cold.
#JITServer: Client successfully loaded method java/net/URL.fromURI(Ljava/net/URI;)Ljava/net/URL; @ cold following compilation request. [metaData=00007F547C82ABF8, startPC=00007F54843D8880]
+ (cold) java/net/URL.fromURI(Ljava/net/URI;)Ljava/net/URL; @ 00007F54843D8880-00007F54843D8F4D OrdinaryMethod - Q_SZ=431 Q_SZI=431 QW=488 j9m=00000000000E4318 bcsz=143 remote time=62654us mem=[region=128 system=2048]KB compThreadID=0 CpuLoad=0%(0%avg) JvmCpu=75% queueTime=127334us