quarkusio / quarkus

Quarkus: Supersonic Subatomic Java.
https://quarkus.io
Apache License 2.0
13.79k stars 2.68k forks source link

CPU load increased 10x with Quarkus 2.1.* (native image) #19359

Closed holledauer closed 2 years ago

holledauer commented 3 years ago

Describe the bug

When upgrading to any Quarkus 2.1.* version the cpu usage increases 10x (just the base load without any requests).

Expected behavior

There shouldn't be an increase in cpu usage

Actual behavior

In the image you can see the cpu load for Quarkus 2.0.2.Final and 2.1.2.Final (low cpu usage is 2.0.2.Final) image

How to Reproduce?

No response

Output of uname -a or ver

No response

Output of java -version

GraalVM 21.2.0 Java 11 CE

GraalVM version (if different from Java)

GraalVM 21.2.0 Java 11 CE

Quarkus version or git rev

2.1.2.Final

Build tool (ie. output of mvnw --version or gradlew --version)

Gradle 6.6.1

Additional information

No response

gsmet commented 3 years ago

Do you reproduce the issue with a JVM run? In this case, you could easily get a profile using Async profiler (see https://github.com/quarkusio/quarkus/blob/main/TROUBLESHOOTING.md for information about it).

Another question: are you building both with the same version of GraalVM.

/cc @johnaohara maybe you can help on this one?

johnaohara commented 3 years ago

@gsmet sure i'll take a look at this one

@holledauer as Guillaume mentioned, if you can reproduce in JVM mode a flamegraph from async profiler would be really useful. If that is not possible, a reproducer based on your application would also be a great help

Thanks

holledauer commented 3 years ago

Both were built with GraalVM 21.2.0 Java 11 CE

holledauer commented 3 years ago

In JVM-Mode I can't see a difference in the CPU load.

holledauer commented 3 years ago

Are you able to reproduce it? I think it should be reproducable with any quarkus app because there is nothing special in my service (it uses Kotlin, is is built with Gradle, it uses resteasy-reactive)

johnaohara commented 3 years ago

@holledauer i am not able to reproduce with a simple app;

CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
7dd58131eec5 quarkus-native-2.1.2.Final 0.00% 12.71MiB / 62.81GiB 0.02% 0B / 0B 0B / 0B 69
c599c9d0e6a3 quarkus-native-2.0.2.Final 0.00% 12MiB / 62.81GiB 0.02% 0B / 0B 0B / 0B 69

What extensions do you have installed? .e.g from the console Installed features: [cdi, kotlin, resteasy-reactive, smallrye-context-propagation]

holledauer commented 3 years ago

I added a minimal app that reproduces the cpu issue.

It is built with ./gradlew build -x test -Dquarkus.package.type=native -Dquarkus.container-image.build=true

quarkuscpu.zip

holledauer commented 3 years ago

Here you can see what happens when I deploy this small sample app on k8s (compared to the full blown app with Quarkus 2.0 ) image

johnaohara commented 3 years ago

Thanks for sharing, the application is very similar to the app that I tested locally. Are you deploying the application to a vanilla K8's environment? do you know what container runtime is used in the cluster you are using? thanks

holledauer commented 3 years ago

AWS EKS with docker-shim and dockerd

NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ZONE NODEGROUP-IMAGE NodeA Ready <none> 8d v1.21.2-13+d2965f0db10712 10.21.22.131 1.2.3.4 Amazon Linux 2 5.4.129-63.229.amzn2.x86_64 docker://19.3.13 eu-central-1a ami-06f3aa48664a0203d

requests: cpu 10m, mem 30Mi. limit: mem 30Mi

holledauer commented 3 years ago

still the same with 2.2.0

MikkelVestergaard commented 3 years ago

I have the same issue. Still the same with 2.2.2. I also tested other versions and version 2.0.2 was the youngest version I did not get the issue with. So I also have the issue with version 2.0.3. So it must be something changed from 2.0.2->2.0.3.

johnaohara commented 3 years ago

I tried recreating this issue running on openshift 4.7, and i can not re-create. I tried with graalvm CE 21.1 and 21.2. Please see below;

cpu-native-2 0 2 Final cpu-native-2 1 2 Final cpu-native-2 2 2 Final cpu-native-2 2 2 Final-grl-21 2

I suspect issue depends on the environment, at present I do not have access to an AWS EKS instance.

One option to investigate this issue would be to try recording CPU samples with JFR for the different quarkus versions: https://www.graalvm.org/reference-manual/native-image/JFR/

johnaohara commented 3 years ago

@MikkelVestergaard

I have the same issue. Still the same with 2.2.2.

Are you also running on AWS EKS, or a different environment?

jaikiran commented 3 years ago

Hello @holledauer,

In the image you can see the cpu load for Quarkus 2.0.2.Final and 2.1.2.Final (low cpu usage is 2.0.2.Final)

Given that you have access to this setup plus are able to test both 2.0.2.Final and 2.1.2.Final to reproduce this, perhaps you could generate 3-4 thread dumps (using maybe jstack) against each of these instances? Try to generate each thread dump with a gap of around 5 seconds each and do this against both 2.0.2.Final and 2.1.2.Final versions. Maybe that will have some hint on what's going on during that period, when even requests aren't being fired/processed. You can then attach those to this issue.

MikkelVestergaard commented 3 years ago

Are you also running on AWS EKS, or a different environment?

I am running on a self-managed Kubernetes cluster (on AWS platform).

holledauer commented 3 years ago

Hello @holledauer,

In the image you can see the cpu load for Quarkus 2.0.2.Final and 2.1.2.Final (low cpu usage is 2.0.2.Final)

Given that you have access to this setup plus are able to test both 2.0.2.Final and 2.1.2.Final to reproduce this, perhaps you could generate 3-4 thread dumps (using maybe jstack) against each of these instances? Try to generate each thread dump with a gap of around 5 seconds each and do this against both 2.0.2.Final and 2.1.2.Final versions. Maybe that will have some hint on what's going on during that period, when even requests aren't being fired/processed. You can then attach those to this issue.

How can I create a thread dump with a native image built with GraalVM CE?

johnaohara commented 3 years ago

How can I create a thread dump with a native image built with GraalVM CE?

Native images will still dump the stack traces if you send a SIGQUIT signal to the process. $ kill -3 {PID} from within the container will print the stack trace to the container console

holledauer commented 3 years ago

How can I create a thread dump with a native image built with GraalVM CE?

Native images will still dump the stack traces if you send a SIGQUIT signal to the process. $ kill -3 {PID} from within the container will print the stack trace to the container console

That only works with the Enterprise Edition (see docs)

johnaohara commented 3 years ago

Stack traces work in CE, heap dumps require EE. I think the way that doc is worded is misleading.

holledauer commented 3 years ago

Thread dumps are attached dump_2_0_2.txt dump_2_1_2.txt

AndreiYu commented 3 years ago

Hi, is there any update on this? It seems we face the same issue with quarkus 2.1+ running as native image in AWS as Docker container

jaikiran commented 3 years ago

From the thread dumps attached so far, I don't see anything obvious that looks like an issue.

johnaohara commented 3 years ago

I could not see anything obvious from the stack traces either. I have obtained some AWS credits to try recreate this issue in the AWS env

holledauer commented 3 years ago

Any updates on this?

MikkelVestergaard commented 2 years ago

any update on this? I tried upgrading to 2.6.3 (running on AWS EKS cluster) but the issue still exist. I also upgraded from Java 11 to Java 17, but that did also not solve the issue.

gsmet commented 2 years ago

@johnaohara ^

u6f6o commented 2 years ago

Any update on this?

We ran into the same issue as well. Currently we have the option to either stick with 2.0.2.Final in native mode or use the jvm mode instead. Our services run in k8s on aws.

The image below shows jvm-mode vs. native mode (2.7.1) image

gsmet commented 2 years ago

If someone could come up with a small reproducer (a small app reproducing the issue and ideally information about the AWS environment you are using), that would be helpful.

Right now, we are shooting in the dark.

gsmet commented 2 years ago

Hmmm, I just saw that @holledauer already provided some information.

@u6f6o would you be able to do the same? That might be interesting to have information about your setup too to see if we can find a common ground.

johnaohara commented 2 years ago

I have attempted to recreate this issue with the supplied sample application. I have tried to recreate on baremetal, OCP 4.7, AWS ECS and minikube 1.23. With these environments, I am unable to recreate the increase in CPU usage that people are seeing.

The graph below is running on AWS ECS;

Screenshot from 2022-02-24 19-37-27

The different native images versions that I tested were;

A) quarkus 2.0.2.Final; graalvm 21.1-java11 B) quarkus 2.0.3.Final; graalvm 21.1-java11 C) quarkus 2.1.2.Final; graalvm 21.1-java11 D) quarkus 2.1.2.Final; graalvm 21.2-java11

After building the images with;

./gradlew clean build -x test -Dquarkus.package.type=native -Dquarkus.container-image.build=true

and for graalVM 21.2-java11;

./gradlew clean build -x test -Dquarkus.package.type=native -Dquarkus.container-image.build=true -Dquarkus.native.builder-image=quay.io/quarkus/ubi-quarkus-native-image:21.2-java11

The docker images were built with this Dockerfile;

FROM registry.access.redhat.com/ubi8/ubi-minimal:8.5-230
WORKDIR /work/
RUN chown 1001 /work \
    && chmod "g+rwX" /work \
    && chown 1001:root /work
COPY --chown=1001:root build/quarkuscpu-1.0-runner /work/application

EXPOSE 8080
USER 1001

CMD ["./application", "-Dquarkus.http.host=0.0.0.0"]

If you would like to test the images in your environment that I used, I have pushed the pre-built images to: https://gallery.ecr.aws/k3j0e3f7/johara-quarkus

Please could someone who is experiencing this issue, please;

1) Share the Docker file that you are using to build the docker image 2) Test the images that I have published for testing (https://gallery.ecr.aws/k3j0e3f7/johara-quarkus) in your environemnt to see if you see a cpu increase with those images 3) When you see the increase in CPU usage, is there also an increase in memory usage? I not have seen this behaviour myself, but am wondering if the container limits are increasing GC

u6f6o commented 2 years ago

I did some further testing in our k8s cluster with the native images. The images from https://gallery.ecr.aws/k3j0e3f7/johara-quarkus did not result in any cpu spikes.

As a next step, I created a small sample project that offers a single rest easy hello world controller. I tested this service with different quarkus versions: 2.0.2, 2.4.2 and 2.7.5. I experienced the same behaviour and starting with versions >2.0.2 the cpu load was much higher compared to earlier versions.

Due to the fact, that there were no cpu spikes with the vanilla images from https://gallery.ecr.aws/k3j0e3f7/johara-quarkus , I tried to get rid of all quarkus dependencies that are not absolutely mandatory for my sample project and surprisingly the cpu spikes were gone even for the latest quarkus version:

image

image

Probably one of the dependencies is causing the cpu spikes. As a next step, I'd try to start from the vanilla version again and to add one dependency at a time and check wether this has impact on the cpu consumption or not.

gsmet commented 2 years ago

@u6f6o thanks for the detective work!

gsmet commented 2 years ago

@u6f6o I would start with the Micrometer extensions if I were you.

gsmet commented 2 years ago

And then the quarkus-logging-ecs extension.

johnaohara commented 2 years ago

@u6f6o thanks for the update, pinning it down to an extensions/extensions would be a massive help, thanks

u6f6o commented 2 years ago

There is a good chance, that I found the culprit. I took vanilla quarkus and added quarkus-micrometer but it had no negative impact. Afterwards I added the io.micrometer:micrometer-registry-prometheus dependency and the cpu spikes appeared again. After that, I added all dependencies except io.micrometer:micrometer-registry-prometheus and the cpu usage did not spike.

image

gsmet commented 2 years ago

Yeah, that was my natural intuition. Now we need to figure out what's going on :).

u6f6o commented 2 years ago

2.0.2-Final

+--- io.micrometer:micrometer-registry-prometheus -> 1.7.0
|    +--- io.micrometer:micrometer-core:1.7.0 (*)
|    \--- io.prometheus:simpleclient_common:0.10.0
|         \--- io.prometheus:simpleclient:0.10.0

vs. 2.7.5-Final


+--- io.micrometer:micrometer-registry-prometheus -> 1.8.3
|    +--- io.micrometer:micrometer-core:1.8.3 (*)
|    \--- io.prometheus:simpleclient_common:0.12.0
|         \--- io.prometheus:simpleclient:0.12.0
|              +--- io.prometheus:simpleclient_tracer_otel:0.12.0
|              |    \--- io.prometheus:simpleclient_tracer_common:0.12.0
|              \--- io.prometheus:simpleclient_tracer_otel_agent:0.12.0
|                   \--- io.prometheus:simpleclient_tracer_common:0.12.0

Pinning io.micrometer:micrometer-registry-prometheus and 'io.micrometer:micrometer-core to 1.7.0 instead of 1.8.3 in quarkus-2.7.5-Final solved the cpu spikes as well. Did not check though, if all metrics and the quarkus integration are fully working with these versions as well.

u6f6o commented 2 years ago

We decided to pin the micrometer version to 1.7.0 in the build.gradle file:

configurations.all {    
    resolutionStrategy {
        eachDependency { DependencyResolveDetails details ->
            if (details.requested.group == 'io.micrometer') {
                details.useVersion "1.7.0"
            }
        }
    }

So far we did not see any issues with this approach (running in production for two days) and the cpu spikes are gone as well. Use at your own risk though :-)

gsmet commented 2 years ago

@u6f6o I wonder if even if you don't see a different in JVM mode, it would be interesting to get a flamegraph of what your app is doing. We have some interesting guidance here for getting flamegraphs: https://github.com/quarkusio/quarkus/blob/main/TROUBLESHOOTING.md .

If you end up doing that, you can send the obtained flamegraphs (the ideal would be one without your workaround and one with your Micrometer version pinning workaround so both code are as close as possible), you can send them to me at gsmet at redhat dot com.

Thanks!

u6f6o commented 2 years ago

@gsmet : I suppose you are interested in the cpu flame graphs, right? We have kubectl-flame in place, which under the hood uses async-profiler. There is one thing so, I can only use -itimer for the flame graphs:

If changing the configuration is not possible, you may fall back to -e itimer profiling mode. It is similar to cpu mode, but does not require perf_events support. As a drawback, there will be no kernel stack traces.

Is this a problem? Which versions would you be interested in:

?

gsmet commented 2 years ago

CPU and allocations would be great if you can get them. Ideally closer versions of everything that exhibits a working and non-working behavior so your proposal looks good if you observe the issue with quarkus 2.7.5 with micrometer 1.7.2.

u6f6o commented 2 years ago

I did some more testing in the past few days. CPU allocation graphs did not show anything obvious but I found out that the cpu spikes are related to a change in micrometers reflection-config.json. This file was not present before 1.7.2 (or maybe 1.7.1) and building my application with the latest micrometer version (1.8.x) but without this file also solved the cpu spikes for me.

The issue only seems to show up in our k8s cluster (on aws) though. Running the same images in a local docker did not cause any cpu spikes. 😩

u6f6o commented 2 years ago

Another workaround that allows you to use the latest quarkus version (2.8.3-Final in my case) is to disable the file descriptor metrics that cause the cpu spikes:

import io.micrometer.core.instrument.Meter
import io.micrometer.core.instrument.config.MeterFilter
import io.micrometer.core.instrument.config.MeterFilterReply
import javax.enterprise.inject.Produces
import javax.inject.Singleton

@Singleton
class MicrometerConfig {

    @Produces
    @Singleton
    fun disableFileDescriptorMetrics() = object : MeterFilter {
        override fun accept(id: Meter.Id): MeterFilterReply =
            if (id.name.startsWith("process.files")) {
                MeterFilterReply.DENY
            } else
                super.accept(id)
            }
}

Imo this is the nicer approach as it allows us to use the latest micrometer libs.

gsmet commented 2 years ago

@ebullient @maxandersen I think we need to do something about this.

Executive summary: on AWS, in native, some metrics that were not enabled previously (because of missing native config) are now enabled and they are causing CPU spikes.

ebullient commented 2 years ago

@jonatan-ivanov and @shakuzen ... here.

I've opened the above issue to reference this one.

shakuzen commented 2 years ago

Hello all, I'm one of the Micrometer maintainers. Thank you for all the great investigation into this. For file descriptor metrics, we're using the JMX OperatingSystemMXBean, which in native images up to GraalVM 22.1 results in PosixSubstrateOperatingSystemMXBean being used at runtime: https://github.com/oracle/graal/blob/da6a62d607552fc958ccb63f4ca1d81e1817cadc/substratevm/src/com.oracle.svm.core.posix/src/com/oracle/svm/core/posix/PosixSubstrateOperatingSystemMXBean.java#L66-L85

I suspect there is something about the implementation there and the file descriptor configuration or operating system when running on AWS that is causing it to use a significant amount of CPU. If you are not seeing the issue when running on the JVM, though, I think there's a good chance the issue will be resolved with GraalVM 22.2, where the JVM's OperatingSystemMXBean implementations will be used instead of substrate-specific substitutions - see https://github.com/oracle/graal/pull/4383.

@u6f6o could you try compiling your app with a 22.2 dev build of GraalVM to see if the issue goes away without workarounds?

Unfortunately, I'm not sure there is much we can do in Quarkus or Micrometer to address this, since we don't control the implementation of OperatingSystemMXBean that will be available at runtime.

u6f6o commented 2 years ago

@shakuzen : I tried to build our service with the dev build you mentioned and the cpu spikes vanished: image

shakuzen commented 2 years ago

@u6f6o thank you for quickly trying that out. I think that hopefully confirms this will be fixed in the 22.2 GraalVM release, which is scheduled for July. If it is important to fix this without the above mentioned workaround prior to the 22.2 release, I would recommend opening an issue in the GraalVM repo. This should be reproducible without Quarkus or Micrometer like the following, but I haven't tried deploying it to AWS.

public static void main(String[] args) {
    OperatingSystemMXBean operatingSystemMXBean = ManagementFactory.getOperatingSystemMXBean();
    ScheduledExecutorService scheduler = Executors.newSingleThreadScheduledExecutor();

    if (operatingSystemMXBean instanceof UnixOperatingSystemMXBean) {
        UnixOperatingSystemMXBean unixOsBean = (UnixOperatingSystemMXBean) operatingSystemMXBean;
        scheduler.scheduleAtFixedRate(() -> {
            System.out.println("File descriptor metrics; open: " + unixOsBean.getOpenFileDescriptorCount() + ", max: " + unixOsBean.getMaxFileDescriptorCount());
        }, 0, 30, TimeUnit.SECONDS);
    } else {
        System.err.println("Cannot get file descriptor metrics on non-unix-like systems.");
    }
}