Closed kusumachalasani closed 1 year ago
@kusumachalasani Can you please add links to results that we already have in this repo as well as pointers to the benchmark scripts to recreate this, thanks
Using shared (quicklab) cluster, I collected the profiles with different configurations using cryostat. From the profiles, the bottlenecks are Socket I/O and GC Pressure (although they are in orange!)
profile254.zip -- is a recording which doesn't have any blocked requests during the run, and throughut rate is 1500 and connections are 1500. profile251.zip -- is a recording which has blocked requests during the run , and throughrate is 2000 and connections are 1500.
Both the above profiles uses cpu request and limit set to 3 cores , and memory set to 1024M. It uses OpenJDK11. TFB application and postgres database are running on a single node and no other applications are running on that node. In both the profiles, I see JVM+appilication usage is very less.
profile21.zip - This is a profile where no cpu and memory limits are set. (where GC Pressure is not observed - expected) but have socket I/O issues
I'm planning to use -XX:MaxRAMPercentage to reduce the GC Pressure where we use memory limits.
@johnaohara Can you please look into the recordings and give your suggestions.
Few changes are made in the configurations based on the discussion with John.
By setting the below RAMPercentage values, GC Pressure warning is not observed. " -XX:MaxRAMPercentage=80 -XX:InitialRAMPercentage=30"
With the above configuration, the max cpu consumed is 0.09 core, and it uses cpu requests and limits set to 2 , memory requests and limits set to 1024M, and the load is run with 3 threads , 3000 throughput rate and 3000 connections. There are hyperfoil blocked requests, when the benchmark is run with more connections / throughput rate.
The profile with the above configuration is profile294.zip
More observations of the profile:
Memory limits are considered during the profile, but not the cpu. For the run, we have set cpu requests and limits as 2 ; and memory requests/limits as 1024M. But the profile shows different data for CPU (same as no.of cores in node)
From CPU Load graph, jvmUser consumes max of 8% , and machineTotal consumes max of 98% during the load run. Need to find what is causing the spikes in machineTotal CPU. During the run, the cluster is running only the TFB benchmark and all others are default apps that comes with cluster. The Quarkus application server , the postgres database server, and the cryostat instance are running in 3 different nodes.
@kusumachalasani Looking at the JFR profile profile294.zip
I do not see anything concerning about the application itself. Application throughput is 99.6% - 98.8% (i.e. % of mutator thread cpu time). As a rule of thumb, for GC tuning we try to keep application throughput above 95%.
I still see the executor threads being blocked periodically, and all at the same time. One thread was blocked for 28.7s waiting on the database. As the database interactions are all SELECT without any UPDATES, we would need to look at what is causing the database to stall
The other issue is the cpu usage by other processes that are not the application. This profile shows the same issue as the previous profiles, in that there is a background processes running on the node that is using on average 48% of cpu time, and the application is using 9.6% cpu time. The backgroud process spikes up to nearly 100% periodically, The spikes in background process CPU usage appears to correspond to when the database connections are blocked.
It is important to understand what is using CPU cycles in the node that is not the application.
I took a profile profilenone3.zip when no load is running and the application is in idle state. I still see spikes in CPU in this case - which indicates, the spikes are not caused by DB connection blocks. When running with load, the maximum cpu consumed in the worker-1 node (where Quarkus application server is running) is around 0.9 cpu out of 4. The calculation in JFR which says jdk.machineTotal as 98% doesn't make sense, as the node cpu utilization under load shows only 25%.
From the data in grafana, I assume the spikes are related to the changes of other default pods which comes with cluster. Below are the list of pods in worker-1 node (where Quarkus application server is running) where I see lot of fluctuations even when no load is running.
community-operators
ingress-canary
ovnkube-node
node-ca
openshift-state-metrics
redhat-marketplace
telemeter-client
When running with load, below are the other pods which consumes some of cpu other than the application pod. We are collecting the metrics using prometheus queries during the load run. So, the cpu consumed by thanos-querier is expected.
thanos-querier
router
Below are the snapshots from grafana about the cpu usage in all worker nodes.
Node: worker-1 with all the pods Node: worker-1 with router-default and tfb-server pods Node: worker-1 with thanos-querier pods
Node: worker-2 with all pods Node: worker-2 with tfb-database pod
Node: worker-0 with cryostat pod. This is for reference to check how much cpu is consumed during profiling. Max of 0.03 cpu is consumed during the profiling.
- Memory limits are considered during the profile, but not the cpu. For the run, we have set cpu requests and limits as 2 ; and memory requests/limits as 1024M. But the profile shows different data for CPU (same as no.of cores in node)
Reply from cryostat team for above observation: This is a known issue and there are new container-aware JFR metrics for CPU/Memory/IO coming in JDK 17: https://bugs.openjdk.java.net/browse/JDK-8203359
Below is the data to compare wrk and wrk2 data in a TFB benchmark (both Quarkus and postgres running in openshift cluster)
CPU_Usage | Throughput | Response time (in ms) | Max response time (in ms) | |
---|---|---|---|---|
wrk2 | 0.88 | 5000 | 1.6 | 742 |
wrk | 2.1 | 11952 | 185 | 2110 |
Below is the data of TFB benchmark when postgres DB is run on openshift cluster and baremetal on both dedicated and shared hardware.
hyperfoil wrk2 client config used in dedicated hardware for both openshift and baremetal postgres is -R 5000 -c 5000
dedicated hardware | postgres on | CPU_Usage | Memory_Usage | Throughput | Response time (in ms) | Max response time (in ms) |
---|---|---|---|---|---|---|
openshift | 0.882 | 478 | 5005 | 1.68 | 742.3 | |
baremetal | 0.836 | 456 | 5000.44 | 2.59 | 1470 |
In a shared hardware, hyperfoil wrk2 client config used in openshift postgres is -R 500 -c 500. and with VM postgres, config used is -R 50 -c 50.
shared hardware | postgres on | CPU_Usage | Memory_Usage | Throughput | Response time (in ms) | Max response time (in ms) |
---|---|---|---|---|---|---|
openshift | 0.249 | 467 | 968.9 | 280.2 | 5300 | |
VM | 0.0347 | 466 | 50 | 509.43 | 2330 |
Closing this for now as we are using hyperfoil/wrk as load simulator and no low cpu usage is observed.
From the experiments with TFB benchmark, the CPU usage with no blocked connections had cpu usage ~0.7 core. Even after increasing the values of different tunables (mentioned below), there are blocked connections (from the logs of hyperfoil) which means we cannot use that data for performance comparison.
Low cpu usage might be because of various reasons like
Tunables of the benchmark: From hyperfoil-wrk2 : connections , throughput-rate , threads From Quarkus: -Dquarkus.thread-pool.core.threads , -Dquarkus.thread-pool.queue.size , -Dquarkus.datasource.jdbc.min-size. -Dquarkus.datasource.jdbc.max-size
Results For micrometer overhead on a TFB benchmark, results of dedicated hardware are available here and for shared hardware here. For results of the TFB benchmark with different tunable configuration, check here
Benchmark To run the TFB benchmark, please follow these instructions.
TODO: Profile the TFB application using JFR to find the root cause for low cpu usage.