kruize / autotune-results

Recommendations and Results from Autotune
Apache License 2.0
3 stars 6 forks source link

Debug low cpu usage with TechEmpower - Quarkus benchmark. #14

Closed kusumachalasani closed 1 year ago

kusumachalasani commented 3 years ago

From the experiments with TFB benchmark, the CPU usage with no blocked connections had cpu usage ~0.7 core. Even after increasing the values of different tunables (mentioned below), there are blocked connections (from the logs of hyperfoil) which means we cannot use that data for performance comparison.

Low cpu usage might be because of various reasons like

Tunables of the benchmark: From hyperfoil-wrk2 : connections , throughput-rate , threads From Quarkus: -Dquarkus.thread-pool.core.threads , -Dquarkus.thread-pool.queue.size , -Dquarkus.datasource.jdbc.min-size. -Dquarkus.datasource.jdbc.max-size

Results For micrometer overhead on a TFB benchmark, results of dedicated hardware are available here and for shared hardware here. For results of the TFB benchmark with different tunable configuration, check here

Benchmark To run the TFB benchmark, please follow these instructions.

TODO: Profile the TFB application using JFR to find the root cause for low cpu usage.

dinogun commented 3 years ago

@kusumachalasani Can you please add links to results that we already have in this repo as well as pointers to the benchmark scripts to recreate this, thanks

kusumachalasani commented 3 years ago

Using shared (quicklab) cluster, I collected the profiles with different configurations using cryostat. From the profiles, the bottlenecks are Socket I/O and GC Pressure (although they are in orange!)

profile254.zip -- is a recording which doesn't have any blocked requests during the run, and throughut rate is 1500 and connections are 1500. profile251.zip -- is a recording which has blocked requests during the run , and throughrate is 2000 and connections are 1500.

Both the above profiles uses cpu request and limit set to 3 cores , and memory set to 1024M. It uses OpenJDK11. TFB application and postgres database are running on a single node and no other applications are running on that node. In both the profiles, I see JVM+appilication usage is very less.

profile21.zip - This is a profile where no cpu and memory limits are set. (where GC Pressure is not observed - expected) but have socket I/O issues

I'm planning to use -XX:MaxRAMPercentage to reduce the GC Pressure where we use memory limits.

@johnaohara Can you please look into the recordings and give your suggestions.

kusumachalasani commented 3 years ago

Few changes are made in the configurations based on the discussion with John.

  1. Set the executor threads to max of 2 * no.of cores assigned. From the profiles mentioned in above comment, there are high no.of executor threads which are mostly parked / blocked.
  2. Set the jdbc connections also to minimum, as by default postgres has 100 default connections. Below are the options used to avoid Socket I/O issue. "-Dquarkus.datasource.jdbc.min-size=4 -Dquarkus.datasource.jdbc.max-size=64 -Dquarkus.datasource.jdbc.initial-size=4 -Dquarkus.thread-pool.max-threads=4 -Dquarkus.thread-pool.core-threads=2 "

By setting the below RAMPercentage values, GC Pressure warning is not observed. " -XX:MaxRAMPercentage=80 -XX:InitialRAMPercentage=30"

With the above configuration, the max cpu consumed is 0.09 core, and it uses cpu requests and limits set to 2 , memory requests and limits set to 1024M, and the load is run with 3 threads , 3000 throughput rate and 3000 connections. There are hyperfoil blocked requests, when the benchmark is run with more connections / throughput rate.

The profile with the above configuration is profile294.zip

More observations of the profile:

johnaohara commented 3 years ago

@kusumachalasani Looking at the JFR profile profile294.zip I do not see anything concerning about the application itself. Application throughput is 99.6% - 98.8% (i.e. % of mutator thread cpu time). As a rule of thumb, for GC tuning we try to keep application throughput above 95%.

I still see the executor threads being blocked periodically, and all at the same time. One thread was blocked for 28.7s waiting on the database. As the database interactions are all SELECT without any UPDATES, we would need to look at what is causing the database to stall

The other issue is the cpu usage by other processes that are not the application. This profile shows the same issue as the previous profiles, in that there is a background processes running on the node that is using on average 48% of cpu time, and the application is using 9.6% cpu time. The backgroud process spikes up to nearly 100% periodically, The spikes in background process CPU usage appears to correspond to when the database connections are blocked.

It is important to understand what is using CPU cycles in the node that is not the application.

kusumachalasani commented 3 years ago

I took a profile profilenone3.zip when no load is running and the application is in idle state. I still see spikes in CPU in this case - which indicates, the spikes are not caused by DB connection blocks. When running with load, the maximum cpu consumed in the worker-1 node (where Quarkus application server is running) is around 0.9 cpu out of 4. The calculation in JFR which says jdk.machineTotal as 98% doesn't make sense, as the node cpu utilization under load shows only 25%.

From the data in grafana, I assume the spikes are related to the changes of other default pods which comes with cluster. Below are the list of pods in worker-1 node (where Quarkus application server is running) where I see lot of fluctuations even when no load is running.

community-operators
ingress-canary
ovnkube-node
node-ca
openshift-state-metrics
redhat-marketplace
telemeter-client

When running with load, below are the other pods which consumes some of cpu other than the application pod. We are collecting the metrics using prometheus queries during the load run. So, the cpu consumed by thanos-querier is expected.

thanos-querier
router

Below are the snapshots from grafana about the cpu usage in all worker nodes.

Node: worker-1 with all the pods image Node: worker-1 with router-default and tfb-server pods image Node: worker-1 with thanos-querier pods image

Node: worker-2 with all pods image Node: worker-2 with tfb-database pod image

Node: worker-0 with cryostat pod. This is for reference to check how much cpu is consumed during profiling. Max of 0.03 cpu is consumed during the profiling. image

kusumachalasani commented 3 years ago
  • Memory limits are considered during the profile, but not the cpu. For the run, we have set cpu requests and limits as 2 ; and memory requests/limits as 1024M. But the profile shows different data for CPU (same as no.of cores in node) image

Reply from cryostat team for above observation: This is a known issue and there are new container-aware JFR metrics for CPU/Memory/IO coming in JDK 17: https://bugs.openjdk.java.net/browse/JDK-8203359

kusumachalasani commented 3 years ago

Below is the data to compare wrk and wrk2 data in a TFB benchmark (both Quarkus and postgres running in openshift cluster)

  CPU_Usage Throughput Response time (in ms) Max response time (in ms)
wrk2 0.88 5000 1.6 742
wrk 2.1 11952 185 2110
kusumachalasani commented 3 years ago

Below is the data of TFB benchmark when postgres DB is run on openshift cluster and baremetal on both dedicated and shared hardware.

hyperfoil wrk2 client config used in dedicated hardware for both openshift and baremetal postgres is -R 5000 -c 5000

dedicated hardware postgres on CPU_Usage Memory_Usage Throughput Response time (in ms) Max response time (in ms)
  openshift 0.882 478 5005 1.68 742.3
  baremetal 0.836 456 5000.44 2.59 1470

In a shared hardware, hyperfoil wrk2 client config used in openshift postgres is -R 500 -c 500. and with VM postgres, config used is -R 50 -c 50.

shared hardware postgres on CPU_Usage Memory_Usage Throughput Response time (in ms) Max response time (in ms)
  openshift 0.249 467 968.9 280.2 5300
  VM 0.0347 466 50 509.43 2330
kusumachalasani commented 1 year ago

Closing this for now as we are using hyperfoil/wrk as load simulator and no low cpu usage is observed.