prometheus / jmx_exporter

A process for collecting metrics using JMX MBeans for Prometheus consumption
http://prometheus.github.io/jmx_exporter/
Apache License 2.0
3.06k stars 1.2k forks source link

Jmx_exporter can cause program to hang if it has a lot of threads #759

Open Selikoff opened 1 year ago

Selikoff commented 1 year ago

Version: jmx_exporter 0.17.2

I noticed if an application has too many threads (15k or more), the jmx_exporter can cause a program to hang. It'll hang the main thread however long it takes to finish the jmx_exporter process (10+ seconds in my tests). I wrote a simple script that can reproduce the issue:

public static void main(String[] args) throws Exception {
    final int count = 15_000;
    final Thread[] thread = new Thread[count];
    for(int i=0; i<thread.length; i++) {
        thread[i] = new Thread(() -> {
            while(true) {
                try {
                    Thread.sleep(500);
                } catch (Exception e) {}
            }
        });
        thread[i].start();
    }

    while(true) {
        System.out.println("[time="+System.currentTimeMillis()+"]");
        Thread.sleep(100);
    }
}

Basically if you run this with the jmx_exporter and call curl http://localhost:123 in the background, it'll freeze the main thread intermittently (about 30% of the time). You might have to adjust some of the timings for it to appear.

I traced the source of the delay to this class ThreadExports.java class, lines 110-124. There is a filter, if enabled, would disable JVM_THREADS_STATE / jvm_threads_state. Enabling this filter prevents the issue from happening.

The problem, and the reason I'm reporting this as an issue, is there's no way to disable just the jvm_threads_state process in jmx_exporter. All of the rules in the config gets executed after the collectors run, not before. I believe the fix would be to pass down information to the HTTPServer.java class. Then, instead of calling metricFamilySamples(), use filteredMetricFamilySamples().

Note: It is possible to disable JVM metric in the curl call to the server, aka curl http://localhost:123?name[]=my_metric but this is extremely limited. In particular, you have to select metrics by name. You can't use regex or negation. Put another way, if you have 3,000 metrics and you want to filter out 1, you would have to list 2,999 using this technique. Ideally, the solution should be part of the jmx_exporter config.

Selikoff commented 1 year ago

Here's some sample output (I add %100_000 to the print statement in the main loop for readability):

[time=82539]
[time=82652]
[time=82767]
[time=82875]  <--- Moment in which the curl command was called
[time=10397]
[time=10809]
[time=10918]
[time=11027]
[time=11137]

In this sample, calling jmx_exporter locks the main thread for 20 seconds. As mentioned, though, it's not consistent. I'd estimate about 30% of the time depending on your local hardware and number of threads.

Selikoff commented 1 year ago

Per this issue, I created a Pull Request that offers a fix: https://github.com/prometheus/jmx_exporter/pull/760

I could have also modified code in java_client, such as the HTTPServer.java class, but since this class already offered Predicate<String> sampleNameFilter , I used that instead.

Using the PR with the following config prevents the main thread from locking up while allowing all other metrics to go through:

collectorNamePattern : "^(?!jvm_threads_state$).*$"            
rules:
  - pattern: ".*"

Even in the case that main thread doesn't lock up, it shortens the time to call curl http://locahost:123 from 20 seconds to 1 seconds in my earlier example.