oracle / graal

GraalVM compiles Java applications into native executables that start instantly, scale fast, and use fewer compute resources 🚀
https://www.graalvm.org
Other
20.44k stars 1.64k forks source link

[GR-51307] Unable to collect GC data with NotificationEmitter in native build #8237

Open viniciusxyz opened 10 months ago

viniciusxyz commented 10 months ago

A favor I ask of the team

This is actually the second time I have opened this issue, the first was #7803 opened on November 11, 2023, but it was closed maybe because I couldn't explain it clearly, I ask that you please be aware that this issue does not refer to compilation problems and yes, the lack of data in the NotificationEmitter at runtime to collect GC metrics, this occurs in both G1GC and SerialGC in both GraalVM CE and Oracle GraalVM

Describe the problem

I'm developing applications and I'm missing some statistics when using Micrometer, being more specific the details and duration of GC pauses, I started validating the lib code and noticed that the problem is actually that it seems that NotificationEmitter is not working launching the notification events and this only happens in the native image, these metrics are very important for us to be able to put the applications into production, so I would like help to solve this problem.

Steps to reproduce the issue

  1. git clone --depth 1 https://github.com/viniciusxyz/graalvm-issue-7803-notification-emmiter.git
  2. Build app mvn clean package -Pnative
  3. Run app ./target/main-notification-emitter

When running the native image, only the notifier addition log will be displayed, and when running with hotspot, when System.gc() is called, a log is displayed from the emission of the NotificationEmitter event

Describe GraalVM and its environment:

More details

Print execution with hotspot:

image

Print from execution with native compilation:

image

All compilation configuration is in the pom file of the project passed in the example

To prove that the problem does not depend on whether it is graalvm ce or oracle graalvm, the print follows with the same behavior:

image

Currently, as far as I know, there are two main ways to expose information about the garbage collector runtimes while the application is running so that we can view it continuously when we are in Kubernetes, the first is through a javaagent that exports this information to any provider such as Prometheus and another is by adding some lib that sends these metrics to one of these providers, but as far as I've seen both forms depend on notificationEmmiter for updates related to these metrics, without this improvement several applications that monitor the GC via Prometheus + micrometer for example will be left without the data for monitoring related to GC times.

Demo of information visualization in grafana

Without native compilation

image

With native compilation

image


@wirthi @kassifar I reopened the issue, if any details were not clear please let me know

wirthi commented 10 months ago

Hi @viniciusxyz

as already mentioned in the previous ticket,

there are some open issues around the management interfaces where we still need some work.

We are aware of that, but this is currently no priority for us to fix.

fniephaus commented 10 months ago

Maybe this is something @roberttoyonaga likes to look into?

roberttoyonaga commented 10 months ago

Maybe this is something @roberttoyonaga likes to look into?

Hi @fniephaus. I'm happy to look into this eventually, if nobody else picks this up. However, it probably wont be on my to-do list for some time.

fniephaus commented 10 months ago

I'm happy to look into this eventually, if nobody else picks this up

Cool, thanks!

However, it probably wont be on my to-do list for some time.

Typo? 😆 We're not going to complain if it's done asap

roberttoyonaga commented 10 months ago

Typo? 😆 We're not going to complain if it's done asap

oops 😆 I mean it will probably remain on my to-do list for a while

viniciusxyz commented 10 months ago

Hi @viniciusxyz

as already mentioned in the previous ticket,

there are some open issues around the management interfaces where we still need some work.

We are aware of that, but this is currently no priority for us to fix.

I understand this point perfectly, I just found it strange to mark it as complete, even though it is a backlog item I believe it should be maintained because problems with GC happen quite frequently and losing traceability on this is a current negative point of native compilation, but I completely understand not be an immediate priority

codesimplicity commented 6 months ago

🙋 we're also interested in having that added.

In the meantime, did anybody already reach out to Micrometer? It looks like this can be detected: the current GraalVM NotificationEmitter does not support any notification types, especially not the one Micrometer filters for. They could detect and log that as they already do if ManagementFactory.getMemoryPoolMXBeans() returns an empty list.

Update: I've now created a PR to detect and log that: https://github.com/micrometer-metrics/micrometer/pull/5149

viniciusxyz commented 6 months ago

🙋 we're also interested in having that added.

In the meantime, did anybody already reach out to Micrometer? It looks like this can be detected: the current GraalVM NotificationEmitter does not support any notification types, especially not the one Micrometer filters for. They could detect and log that as they already do if ManagementFactory.getMemoryPoolMXBeans() returns an empty list.

Very interesting ! I hadn't paid attention to this detail in Micrometer, I'm going to download the code and do some tests, but for my part I didn't know there was this other way so I didn't open an issue there

codesimplicity commented 6 months ago

Very interesting ! I hadn't paid attention to this detail in Micrometer, I'm going to download the code and do some tests, but for my part I didn't know there was this other way so I didn't open an issue there

Well, it will not really help with the issue that there will be no metrics. But you'd at least get a log message why. I was specifically referring to this code in JvmGcMetrics.

viniciusxyz commented 6 months ago

Very interesting ! I hadn't paid attention to this detail in Micrometer, I'm going to download the code and do some tests, but for my part I didn't know there was this other way so I didn't open an issue there

Well, it will not really help with the issue that there will be no metrics. But you'd at least get a log message why. I was specifically referring to this code in JvmGcMetrics.

I understood. I thought there was some fallback, but if there really isn't, then calling the micrometer team probably won't help much since the problem is the lack of notifications in the native compilation.

It's quite frustrating to have to make monitoring worse to take advantage of the benefits of native images, but this is a necessary choice at the moment as far as I understand.

roberttoyonaga commented 1 month ago

Hi @viniciusxyz and @codesimplicity ! I've submitted a PR adding support for GC notifications (only for serial GC for now). If you have time, please let me know if this solution works for your use case. Thanks! https://github.com/oracle/graal/pull/9799

viniciusxyz commented 1 month ago

@roberttoyonaga As soon as possible I will try to understand how the process of compiling graalvm works locally to test its implementation and I will report back. I really appreciate the work.

roberttoyonaga commented 1 month ago

@roberttoyonaga As soon as possible I will try to understand how the process of compiling graalvm works locally to test its implementation and I will report back. I really appreciate the work.

No problem! Here are some brief steps to get you started with building GraalVM:

You'll need mx. And the latest labsjdk release.

Put mx on the path and set java home to labsjdk: export PATH=/path/to/mx:$PATH export JAVA_HOME=/path/to/labsjdk Navigate to the vm/README.md directory for instructions on how to use mx build command to build GraalVM with the substratevm component.

Once GraalVM is built, you can use mx native-image to create native image executables or just find the native-image binary in the build directory. You'll also be able to find a link to the latest build in graal/vm/latest_graalvm_home

The first few minutes of this video does a good job of explaining this as well: https://youtu.be/3Gh0cz3vjG8?feature=shared&t=202

viniciusxyz commented 1 month ago

@roberttoyonaga Unfortunately I had some problems getting this to work on Windows, I'll try again tomorrow using Linux

viniciusxyz commented 1 month ago

@roberttoyonaga I'm having some problems getting the test done, I can compile the graal using mx --env ce build, but from what I understand from the documentation it doesn't have the native-image tool and the mx native-image apparently does not exist, I appealed and tried to run a mx build --all, but I still have the same problem, can you please guide me on how to create the native-image binary?

image

roberttoyonaga commented 1 month ago

@viniciusxyz maybe try mx --dynamicimports /substratevm build from the vm directory. And then try mx build from the substratevm directory.

viniciusxyz commented 1 month ago

@roberttoyonaga By following the steps I managed to compile it, but the result was the same as opening the issue, I'm using the right branch as you can see below:

image

But the result of the compilation is still using scavenger, probably I must still be doing something wrong, but I don't know what...

image

The steps I did were:

  1. Adjust the mx and labsjdk envs as I downloaded them with the commands below:
export PATH=/home/vvsantos/mx:$PATH
export JAVA_HOME=/home/vvsantos/labsjdk-ce-24-jvmci-b01-debug
  1. Enter the correct graal/vm directory and compile with the command below:
mx --dynamicimports /substratevm build
  1. Go to the substratevm directory and compile with the command below
cd ../substratevm 
mx build
  1. Copy my class to this directory and compile it
javac MainNotificationEmitter.java
mx native-image MainNotificationEmitter
  1. Execute
./mainnotificationemitter

The class used for testing was this

import com.sun.management.GarbageCollectionNotificationInfo;

import javax.management.NotificationEmitter;
import java.lang.management.GarbageCollectorMXBean;
import java.lang.management.ManagementFactory;

public class MainNotificationEmitter {

    public static void main(String[] args) throws InterruptedException {
        for (GarbageCollectorMXBean gcBean : ManagementFactory.getGarbageCollectorMXBeans()) {
            if (!(gcBean instanceof NotificationEmitter notificationEmitter)) {
                continue;
            }
            System.out.println("NotificationEmitter " + gcBean.getName());
            notificationEmitter.addNotificationListener((notification, handback) -> {

                var type = notification.getType();

                System.out.printf("Notification  issued -> %s\n", type);
            }, notification -> notification.getType()
                    .equals(GarbageCollectionNotificationInfo.GARBAGE_COLLECTION_NOTIFICATION), null);
        }
        System.out.println("Notifier added");
        System.gc();
        Thread.sleep(3_000);
        System.out.println("End");
    }
}

Can you identify if I'm doing something wrong in this flow?

roberttoyonaga commented 1 month ago

Hi @viniciusxyz you need to use the build options shown in the PR description when you execute the Native Image build.

In short, you need to use mx native-image --enable-monitoring=gcnotif MainNotificationEmitter. Everything else you are doing looks correct.

As a side note: To use any of the monitoring/serviceability features in Native Image, you need to pass --enable-monitoring at build time. For example, --enable-monitoring=jfr,nmt,heapdump,gcnotif or --enable-monitoring==all.

If you have any other questions, please don't hesitate to ask!

viniciusxyz commented 1 month ago

@roberttoyonaga I read this yesterday and I had already forgotten about it today :) I managed to test it, I just found it strange that there was an extra notification after the main code had finished executing

image

When we build with mx, is the native-image binary available in the labsjdk bin? I ask because I would like to do a complete test by compiling an application using maven and reachability-metadata (I need it to compile some things) and uploading it to Kubernetes to validate if the micrometer metrics are working correctly with this implementation

roberttoyonaga commented 1 month ago

I just found it strange that there was an extra notification after the main code had finished executing

Oh yes, that's a small bug. When the system gets teared-down in emits another notification. I'll fix that and push an update.

When we build with mx, is the native-image binary available in the labsjdk bin?

No its not in labsjdk. You can find the latest build in the graal repository here: graal/vm/latest_graalvm_home The latest native-image tool is here: graal/vm/latest_graalvm_home/bin/native-image

viniciusxyz commented 1 month ago

@roberttoyonaga I started the integrated tests, basically I have an application that uses micronaut in conjunction with micrometer that uses jvm metrics to launch metrics for prometheus, initially notificationEmmiter seems to be working well since now I can see the pauses associated with the GC, strangely the metrics related to HEAP memory broke, but perhaps it has nothing to do with the change you made, another point that I found strange was the GC pauses that only increased more and more, I will start the application with the GC monitoring flags and compare the data to validate whether they are correct, I will give you a more complete feedback as soon as possible. Below is the image with the GC time metric that is super important for us and that now exists with its change :)

image

roberttoyonaga commented 1 month ago

Hi @viniciusxyz

strangely the metrics related to HEAP memory broke

What do you mean by this? Did the notifications stop containing the memory usage info? Or did the memory usages reported look strange?

I made a very simple test app here: https://gist.github.com/roberttoyonaga/44bfe8cbaa809c102fa5d9cc959f2997 Using it, it looks like the memory usages reported are normal. Also I don't see that the GC durations are increasing.

viniciusxyz commented 1 month ago

@roberttoyonaga I don't know if it was your change, but there seems to have been a change in behavior in the memory metrics related to MemoryPoolMXBean, the screenshot below shows the behavior in the jdk 21 lts that I generally use:

image

Jdk 24 with your PR:

image

Code for test:

https://gist.github.com/viniciusxyz/c0883ec5a499351225bebf9a8ccbc5ea

What I'm doing for now is using memory metrics that are provided by micrometer so I'm testing the functioning of the resources necessary for these 3 classes to work:

JvmHeapPressureMetrics JvmGcMetrics JvmMemoryMetrics

As for measuring GC times, everything really seems to be ok, I made a comparison using -XX:+PrintGC and there doesn't seem to be anything abnormal

viniciusxyz commented 1 month ago

@roberttoyonaga I did a new test and it seems that this metric has disappeared and has no relation to your commit, checking out the previous commit and redoing the build I noticed the same problem, now why does this work in Java 21, but in Java 24 it is like this I don't know. As for the GC metrics themselves, everything seems to be correct.

Test with commit a8efbf7c2981

image

roberttoyonaga commented 1 month ago

Hi @viniciusxyz this is actually expected. Please see the comment here.

The max sizes are always -1 in JDK 24 becuase of this PR https://github.com/oracle/graal/pull/6930 Essentially, the GC only begins collections after the max size has been exceeded. So it's not safe to rely on the max sizes never being exceeded. That's why the max sizes are now always -1.

viniciusxyz commented 1 month ago

@roberttoyonaga I understand, there are some graphs we use that use this maximum to be able to demonstrate in percentage terms the amount of heap used at the moment, but this can be removed without any problems, thank you for the clarification