Debug low cpu usage with TechEmpower - Quarkus benchmark.

kusumachalasani commented 3 years ago

From the experiments with TFB benchmark, the CPU usage with no blocked connections had cpu usage ~0.7 core. Even after increasing the values of different tunables (mentioned below), there are blocked connections (from the logs of hyperfoil) which means we cannot use that data for performance comparison.

Low cpu usage might be because of various reasons like

CPU is underutilized or cpu time is being blocked by resource contention. Increasing DB connections will allow more connections to the DB, but this will just push the load bottleneck to the database.
- Postgres does not scale well with a lot of connections open and lots of concurrent requests and application is stalled waiting on the DB to return the results.

Tunables of the benchmark: From hyperfoil-wrk2 : connections , throughput-rate , threads From Quarkus: -Dquarkus.thread-pool.core.threads , -Dquarkus.thread-pool.queue.size , -Dquarkus.datasource.jdbc.min-size. -Dquarkus.datasource.jdbc.max-size

Results For micrometer overhead on a TFB benchmark, results of dedicated hardware are available here and for shared hardware here. For results of the TFB benchmark with different tunable configuration, check here

Benchmark To run the TFB benchmark, please follow these instructions.

TODO: Profile the TFB application using JFR to find the root cause for low cpu usage.

dinogun commented 3 years ago

@kusumachalasani Can you please add links to results that we already have in this repo as well as pointers to the benchmark scripts to recreate this, thanks

kusumachalasani commented 3 years ago

Using shared (quicklab) cluster, I collected the profiles with different configurations using cryostat. From the profiles, the bottlenecks are Socket I/O and GC Pressure (although they are in orange!)

profile254.zip -- is a recording which doesn't have any blocked requests during the run, and throughut rate is 1500 and connections are 1500. profile251.zip -- is a recording which has blocked requests during the run , and throughrate is 2000 and connections are 1500.

Both the above profiles uses cpu request and limit set to 3 cores , and memory set to 1024M. It uses OpenJDK11. TFB application and postgres database are running on a single node and no other applications are running on that node. In both the profiles, I see JVM+appilication usage is very less.

profile21.zip - This is a profile where no cpu and memory limits are set. (where GC Pressure is not observed - expected) but have socket I/O issues

I'm planning to use -XX:MaxRAMPercentage to reduce the GC Pressure where we use memory limits.

@johnaohara Can you please look into the recordings and give your suggestions.

kusumachalasani commented 3 years ago

Few changes are made in the configurations based on the discussion with John.

Set the executor threads to max of 2 * no.of cores assigned. From the profiles mentioned in above comment, there are high no.of executor threads which are mostly parked / blocked.
Set the jdbc connections also to minimum, as by default postgres has 100 default connections. Below are the options used to avoid Socket I/O issue. "-Dquarkus.datasource.jdbc.min-size=4 -Dquarkus.datasource.jdbc.max-size=64 -Dquarkus.datasource.jdbc.initial-size=4 -Dquarkus.thread-pool.max-threads=4 -Dquarkus.thread-pool.core-threads=2 "

By setting the below RAMPercentage values, GC Pressure warning is not observed. " -XX:MaxRAMPercentage=80 -XX:InitialRAMPercentage=30"

With the above configuration, the max cpu consumed is 0.09 core, and it uses cpu requests and limits set to 2 , memory requests and limits set to 1024M, and the load is run with 3 threads , 3000 throughput rate and 3000 connections. There are hyperfoil blocked requests, when the benchmark is run with more connections / throughput rate.

The profile with the above configuration is profile294.zip

More observations of the profile:

Memory limits are considered during the profile, but not the cpu. For the run, we have set cpu requests and limits as 2 ; and memory requests/limits as 1024M. But the profile shows different data for CPU (same as no.of cores in node)
From CPU Load graph, jvmUser consumes max of 8% , and machineTotal consumes max of 98% during the load run. Need to find what is causing the spikes in machineTotal CPU. During the run, the cluster is running only the TFB benchmark and all others are default apps that comes with cluster. The Quarkus application server , the postgres database server, and the cryostat instance are running in 3 different nodes.

johnaohara commented 3 years ago

@kusumachalasani Looking at the JFR profile profile294.zip I do not see anything concerning about the application itself. Application throughput is 99.6% - 98.8% (i.e. % of mutator thread cpu time). As a rule of thumb, for GC tuning we try to keep application throughput above 95%.

I still see the executor threads being blocked periodically, and all at the same time. One thread was blocked for 28.7s waiting on the database. As the database interactions are all SELECT without any UPDATES, we would need to look at what is causing the database to stall

The other issue is the cpu usage by other processes that are not the application. This profile shows the same issue as the previous profiles, in that there is a background processes running on the node that is using on average 48% of cpu time, and the application is using 9.6% cpu time. The backgroud process spikes up to nearly 100% periodically, The spikes in background process CPU usage appears to correspond to when the database connections are blocked.

It is important to understand what is using CPU cycles in the node that is not the application.

kusumachalasani commented 3 years ago

I took a profile profilenone3.zip when no load is running and the application is in idle state. I still see spikes in CPU in this case - which indicates, the spikes are not caused by DB connection blocks. When running with load, the maximum cpu consumed in the worker-1 node (where Quarkus application server is running) is around 0.9 cpu out of 4. The calculation in JFR which says jdk.machineTotal as 98% doesn't make sense, as the node cpu utilization under load shows only 25%.

From the data in grafana, I assume the spikes are related to the changes of other default pods which comes with cluster. Below are the list of pods in worker-1 node (where Quarkus application server is running) where I see lot of fluctuations even when no load is running.

community-operators
ingress-canary
ovnkube-node
node-ca
openshift-state-metrics
redhat-marketplace
telemeter-client

When running with load, below are the other pods which consumes some of cpu other than the application pod. We are collecting the metrics using prometheus queries during the load run. So, the cpu consumed by thanos-querier is expected.

thanos-querier
router

Below are the snapshots from grafana about the cpu usage in all worker nodes.

Node: worker-1 with all the pods Node: worker-1 with router-default and tfb-server pods Node: worker-1 with thanos-querier pods

Node: worker-2 with all pods Node: worker-2 with tfb-database pod

Node: worker-0 with cryostat pod. This is for reference to check how much cpu is consumed during profiling. Max of 0.03 cpu is consumed during the profiling.

kusumachalasani commented 3 years ago

Memory limits are considered during the profile, but not the cpu. For the run, we have set cpu requests and limits as 2 ; and memory requests/limits as 1024M. But the profile shows different data for CPU (same as no.of cores in node)

Reply from cryostat team for above observation: This is a known issue and there are new container-aware JFR metrics for CPU/Memory/IO coming in JDK 17: https://bugs.openjdk.java.net/browse/JDK-8203359

kusumachalasani commented 3 years ago

Below is the data to compare wrk and wrk2 data in a TFB benchmark (both Quarkus and postgres running in openshift cluster)

	CPU_Usage	Throughput	Response time (in ms)	Max response time (in ms)
wrk2	0.88	5000	1.6	742
wrk	2.1	11952	185	2110

kusumachalasani commented 3 years ago

Below is the data of TFB benchmark when postgres DB is run on openshift cluster and baremetal on both dedicated and shared hardware.

hyperfoil wrk2 client config used in dedicated hardware for both openshift and baremetal postgres is -R 5000 -c 5000

-R indicates the throughput rate and -c indicates the http connections from client to app server. In both openshift and baremetal cases, when more than 5000 throughput rate is used, blocked connections are observed in hyperfoil/wrk2 logs.

dedicated hardware	postgres on	CPU_Usage	Memory_Usage	Throughput	Response time (in ms)	Max response time (in ms)
	openshift	0.882	478	5005	1.68	742.3
	baremetal	0.836	456	5000.44	2.59	1470

In a shared hardware, hyperfoil wrk2 client config used in openshift postgres is -R 500 -c 500. and with VM postgres, config used is -R 50 -c 50.

In both openshift and VM cases, with more throughput rate other than the mentioned config above, blocked connections are observed in hyperfoil/wrk2 logs.

shared hardware	postgres on	CPU_Usage	Memory_Usage	Throughput	Response time (in ms)	Max response time (in ms)
	openshift	0.249	467	968.9	280.2	5300
	VM	0.0347	466	50	509.43	2330

kusumachalasani commented 1 year ago

Closing this for now as we are using hyperfoil/wrk as load simulator and no low cpu usage is observed.

kruize / autotune-results

Debug low cpu usage with TechEmpower - Quarkus benchmark. #14