prometheus / jmx_exporter

A process for exposing JMX Beans via HTTP for Prometheus consumption
Apache License 2.0
3.05k stars 1.2k forks source link

File descriptors opened by jmx exporter getting stuck in CLOSE_WAIT state #911

Closed anilsomisetty closed 5 months ago

anilsomisetty commented 9 months ago

I am using trino where I am using the jmx exporter of version 19 to expose my application metrics over a http port for my prometheus agent to collect these metrics.

I have my prometheus agent configured to collect metrics from the exposed http port with a specific timeout.

My issue: File descriptors are stuck in CLOSE_WAIT state and their count keeps on increasing, after the file descriptor count reaches a specific limit my application is crashing and becoming unresponsive.

In my case what's happening is the file descriptor connections that has been opened between jmx exporter and my prometheus metric collection agent are getting stuck in CLOSE_WAIT state because my agent is exiting after the timeout and the exporter is not able to provide the metrics within specified timeout.

The exporter is stuck in writing metrics to the descriptor because on the other side agent reading the metrics has left and ByteArrayOutputStream is stuck in write call and the file descriptor is not getting closed at this line of code

https://github.com/prometheus/client_java/blob/ed0d7ae3b57a3986f6531d1a37db031a331227e6/simpleclient_httpserver/src/main/java/io/prometheus/client/exporter/HTTPServer.java#L126

My agent collecting the metrics leaves after specified timeout, but ideally the exporter should understand that the requestor has left and should close the connection and shouldn't get stuck in write call and shouldn't keep the fd open.

I understand that once the agent leaves after timeout and exporter will not be able to give it the entire metrics of that moment and the metrics will be lost at that timeframe which is fine for me.

Please help me with this.

rwilliams-r7 commented 8 months ago

We are also having this problem. We have found it can be so bad it will stop the application running in this case a cluster of Trino instances. Do you have any time to work on this? We had to remove it from our production env.

dhoard commented 8 months ago

The core HTTP server code is in the Prometheus client_java library. It will have to expose configuration to allow setting the socket timeout. I still have concerns around the RMI socket timeouts.

prometheus-metrics-exporter-httpserver

Relevant class: https://github.com/prometheus/client_java/blob/main/prometheus-metrics-exporter-httpserver/src/main/java/io/prometheus/metrics/exporter/httpserver/HTTPServer.java

I create an issue for the client_java library (https://github.com/prometheus/client_java/issues/924) Once added, the configuration can be exposed via the exporter configuration.

anilsomisetty commented 8 months ago

I have tested my application with the latest changes available in main branch of the exporter which uses latest version of prometheus http server from client java repository and I see that when I'm using that my file descriptor connections stuck in CLOSE_WAIT state are getting cleared, could you please confirm what changes have been made that the fd's are getting cleared without adding the timeout ? When would be the official release of the latest version of jmx exporter which is integrated with client_java version of v1.1.0 ?

dhoard commented 8 months ago

@anilsomisetty I am not aware of any direct changes in client_java v1.1.0 that would have changed the behavior regarding CLOSE_WAIT sockets.

We are targeting a beta release (due to the major changes made in client_java) of the main code in the next 2 weeks.

anilsomisetty commented 8 months ago

@dhoard I see that due to the changes that have been made in this commit of client java code https://github.com/prometheus/client_java/commit/1966186deaaa83aec496d88ff604a90795bad688 after version 0.16.0 release the connections getting stuck in CLOSE_WAIT state are getting cleared after 2 minutes as they are in idle state.

I have tested the 0.19.0 version of jmx exporter integrated with 0.16.0 release of client java with this commit changes included and have seen that the close wait stuck connection are getting closed which was not the case without this commit changes.

dhoard commented 8 months ago

@anilsomisetty Interesting observation.

The issue with the code in https://github.com/prometheus/client_java/commit/1966186deaaa83aec496d88ff604a90795bad688 is that the ThreadPoolExecutor is unbounded.

The code in the latest release should behave the same way.

Is it possible for you build and test the main branch (unreleased) to see if you can reproduce the issue?

anilsomisetty commented 8 months ago

@dhoard I have build the main branch code of jmx exporter which is using client java version v1.1.0 and I'm not able to reproduce the issue, the stuck close wait connections are getting closed and are cleared.

In the latest code i see ThreadPoolExecutor has been bounded by corepoolsize and maxpoolsize so it makes this bounded am I right ?

dhoard commented 8 months ago

@anilsomisetty correct - the ThreadPoolExecutor is now bounded.

Thanks for testing the code in main (unreleased)!

Do you experience the issue with the latest release 0.20.0?

anilsomisetty commented 8 months ago

@dhoard yes, the issue persists in the latest release 0.20.0 As there is a default limit of max 10 threads set in this latest release when all 10 connections are getting stuck in close wait state no new connections are getting created.

Could you please tell me when is the plan to have a generally available official release if beta is planned in next 2 weeks ?

liransliveperson commented 7 months ago

Hi, we are facing the same issue, our java application is using : -javaagent:jmx_prometheus_javaagent-0.19.0.jar=14102 And we are getting aggregation of file descriptor files stuck in CLOSE_WAIT, we need to restart our application in production once every 2 weeks.

here are some of the file descryptor stuck : java 19727 appuser 223u IPv4 1411652589 0t0 TCP ssor-jet135.lpdomain.com:14102->svor-mon28.lpdomain.com:59184 (CLOSE_WAIT) java 19727 appuser 224u IPv4 1411912918 0t0 TCP ssor-jet135.lpdomain.com:14102->svor-mon28.lpdomain.com:56524 (CLOSE_WAIT) java 19727 appuser 225u IPv4 1412503559 0t0 TCP ssor-jet135.lpdomain.com:14102->svor-mon28.lpdomain.com:53210 (CLOSE_WAIT) java 19727 appuser 226u IPv4 1411689546 0t0 TCP ssor-jet135.lpdomain.com:14102->svor-mon28.lpdomain.com:34842 (CLOSE_WAIT) java 19727 appuser *227u IPv4 1412056270 0t0 TCP ssor-jet135.lpdomain.com:14102->svor-mon28.lpdomain.com:55478 (CLOSE_WAIT)

We have around 20k of those and then need to restart the application.

Is there an expected solution for that soon ?

dhoard commented 7 months ago

@anilsomisetty @liransliveperson I have updated the main branch to use client_java 1.2.0, which has some fixes around closing the HttpExchange if there is an exception writing the metrics response to the client.

Is it possible for you to build/test main?

liransliveperson commented 7 months ago

@dhoard Im using the main version, I have created a jar from it and using it. When trying to load the metrics I'm getting :

An Exception occurred while scraping metrics: java.lang.IllegalArgumentException: Duplicate labels in metric data: {brandId="10138221"} at io.prometheus.metrics.model.snapshots.MetricSnapshot.validateLabels(MetricSnapshot.java:46) at io.prometheus.metrics.model.snapshots.MetricSnapshot.(MetricSnapshot.java:33) at io.prometheus.metrics.model.snapshots.UnknownSnapshot.(UnknownSnapshot.java:21) at io.prometheus.metrics.model.snapshots.UnknownSnapshot$Builder.build(UnknownSnapshot.java:130) at io.prometheus.jmx.JmxCollector.collect(JmxCollector.java:836) at io.prometheus.metrics.model.registry.MultiCollector.collect(MultiCollector.java:26) at io.prometheus.metrics.model.registry.PrometheusRegistry.scrape(PrometheusRegistry.java:72) at io.prometheus.metrics.exporter.common.PrometheusScrapeHandler.scrape(PrometheusScrapeHandler.java:112) at io.prometheus.metrics.exporter.common.PrometheusScrapeHandler.handleRequest(PrometheusScrapeHandler.java:53) at io.prometheus.metrics.exporter.httpserver.MetricsHandler.handle(MetricsHandler.java:43) at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:79) at sun.net.httpserver.AuthFilter.doFilter(AuthFilter.java:83) at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:82) at sun.net.httpserver.ServerImpl$Exchange$LinkHandler.handle(ServerImpl.java:675) at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:79) at sun.net.httpserver.ServerImpl$Exchange.run(ServerImpl.java:647) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

This issue does not happening with jmx_prometheus_javaagent-0.19.0.jar Could you please advise ?

dhoard commented 7 months ago

@liransliveperson interesting. The current code is main which uses client_java 1.2.0, which is completely new.

I would check your rules. The exception is stating that you are trying to add a duplicate label - label names have to be unique.

liransliveperson commented 7 months ago

@dhoard I have check my rules, there are no duplicate rules, I have used the same rules, one with the main version here-> that produces exception, Then with the jmx_prometheus_javaagent-0.19.0.jar that not produces the exception. The rules are exactly the same in both cases. there is some issue in your main code.

dhoard commented 7 months ago

@liransliveperson This appears to be an exporter rule configuration issue that the older version of client_java allowed (but was invalid.) An exporter rule must produce a unique metric (name + labels.) This can occur if a rule uses a subset of values as labels.

Can you provide the specific exporter rule and a list of attributes on the MBean?

liransliveperson commented 7 months ago

@dhoard we have many rules and attributes in the MBean its hard to put it all in here. Is there an option to add a code in your project where we will able to know which datapoints are the duplicates? That will help us to find the problem

dhoard commented 7 months ago

@liransliveperson we don't have access to the underlying information in the JMX Exporter, but I feel we could add it to the exception when implementing https://github.com/prometheus/client_java/issues/942

dhoard commented 7 months ago

Here is a JUnit test that reproduces the issue:

package io.prometheus.jmx;

import io.prometheus.metrics.model.snapshots.Labels;
import io.prometheus.metrics.model.snapshots.MetricSnapshots;
import io.prometheus.metrics.model.snapshots.UnknownSnapshot;
import org.junit.Before;
import org.junit.Test;

import java.util.HashMap;
import java.util.Map;

import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertThrows;

public class DuplicateLabels {

    Map<String, UnknownSnapshot.Builder> unknownMap;

    @Before
    public void setUp() {
        unknownMap = new HashMap<>();
    }

    @Test
    public void testDuplicateLabels() {
        UnknownSnapshot.Builder unknownBuilder =
                unknownMap.computeIfAbsent(
                        "test",
                        name ->
                                UnknownSnapshot.builder()
                                        .name("test_metric")
                                        .help("test_metric help"));
        unknownBuilder.dataPoint(
                UnknownSnapshot.UnknownDataPointSnapshot.builder()
                        .value(1.12345678)
                        .labels(
                                Labels.of(
                                        "label1", "value1"))
                        .build());

        unknownMap.put("test", unknownBuilder);

        unknownBuilder =
                unknownMap.computeIfAbsent(
                        "test",
                        name ->
                                UnknownSnapshot.builder()
                                        .name("test_metric")
                                        .help("test_metric help"));
        unknownBuilder.dataPoint(
                UnknownSnapshot.UnknownDataPointSnapshot.builder()
                        .value(2.2468)
                        .labels(
                                Labels.of(
                                        "label1", "value1"))
                        .build());

        unknownMap.put("test2", unknownBuilder);

        MetricSnapshots.Builder result = MetricSnapshots.builder();

        Exception exception = assertThrows(IllegalArgumentException.class, () -> {
            for (UnknownSnapshot.Builder unknown : unknownMap.values()) {
                UnknownSnapshot unknownSnapshot = unknown.build();
                result.metricSnapshot(unknownSnapshot);
            }
        });

        assertEquals("Duplicate labels in metric data: {label1=\"value1\"}", exception.getMessage());
    }
}
liransliveperson commented 7 months ago

@dhoard is the master ready for build with the detailed exception on duplicate labels ? Im trying to build the project but in class DuplicateLabels it cant find : import io.prometheus.metrics.model.snapshots.DuplicateLabelsException;

liransliveperson commented 7 months ago

@dhoard here is our servers metrics : duplicate_lables.xlsx I want able to find any duplicate labels, I put that in excel sheet, every row has the row metric label, then went to 'Conditional Formatting' -> 'Highlight cell rules' -> 'Duplicate values' And I cand find any duplicate

dhoard commented 7 months ago

@liransliveperson I have disabled the test. DuplicateLabelsException will be in a new version of client_java, which hasn't been published yet.

Until the new version of client_java is published, the exporter code doesn't have the information to provide further information.

If you want to test it some code with the exception displaying more information...

  1. clone https://github.com/prometheus/client_java and build/install it (./mvnw clean install)
  2. clone my private branch https://github.com/dhoard/jmx_exporter/tree/test-branch and build it (./mvnw clean package)

The jar will be in ./jmx_prometheus_javaagent/target/jmx_prometheus_javaagent-1.0.0-BETA.jar

liransliveperson commented 7 months ago

@dhoard I wasnt able to build https://github.com/prometheus/client_java it has test error in : testGoodCase(ProcessMetricsTest.java:76)

dhoard commented 7 months ago

Hmm... client_java is showing builds passing.

https://app.circleci.com/pipelines/github/prometheus/client_java

My test branch requires the client_java libraries built off of main. You will have to resolve the client_java build issues first.

liransliveperson commented 7 months ago

Try to run localy ProcessMetricsTest

dhoard commented 7 months ago

What is your OS version? JDK version?

liransliveperson commented 7 months ago

@dhoard I was able to deploy the test version with the detailed exception, The exception says :

An Exception occurred while scraping metrics: io.prometheus.metrics.model.snapshots.DuplicateLabelsException: Duplicate labels for metric "metricsPerBrand_Acd_WaitingConvs_Value": {brandId="10138221"} at io.prometheus.metrics.model.snapshots.MetricSnapshot.validateLabels(MetricSnapshot.java:46) at io.prometheus.metrics.model.snapshots.MetricSnapshot.(MetricSnapshot.java:33)

Looking on our metrics we have multiple "metricsPerBrand_Acd_WaitingConvs_Value" for different brands :

metricsPerBrand_Acd_WaitingConvs_Value{brandId="10138221",} 0.0 metricsPerBrand_Acd_WaitingConvs_Value{brandId="87947798",} 0.0 metricsPerBrand_Acd_WaitingConvs_Value{brandId="86122336",} 0.0 metricsPerBrand_Acd_WaitingConvs_Value{brandId="12660041",} 0.0

But only one for brand Id 10138221 : metricsPerBrand_Acd_WaitingConvs_Value{brandId="10138221",}

This is the Yaml rule:

dhoard commented 7 months ago

@liransliveperson Can you provide the full stacktrace?

EDIT:

I can add some test debug in my branch.

Can you enable some debugging logging by either definiing...

  1. an environment variable JMX_PROMETHEUS_EXPORTER_DEVELOPER_DEBUG=true

or

  1. a JVM system property jmx.prometheus.exporter.developer.debug=true

... and should get a debug print for ever metric that's being added.

            // Add to samples.
            LOGGER.log(
                    FINE,
                    "add metric sample: %s %s %s %s",
                    matchedRule.name,
                    matchedRule.labelNames,
                    matchedRule.labelValues,
                    value.doubleValue());
liransliveperson commented 7 months ago

I was able to print all the labels for a metric. In the jmx_prometheus_exporter.yaml I have only those rules :

And looks like the jmx exporter sees labels :

An Exception occurred while scraping metrics: io.prometheus.metrics.model.snapshots.DuplicateLabelsException: Duplicate labels for metric "metricsPerBrand_BotAcd_SingleConv_DispatchTimeMs_Mean": {brandId="29422842"} All Labels : {brandId="29422842"},{brandId="29422842"},.

Although in the JMX we have only once the brand 29422842.

dhoard commented 7 months ago

@liransliveperson can you enable the debug? (the output will be to where ever system out is captured) can you provide your complete YAML file?

liransliveperson commented 7 months ago

Here is the Yaml file jmx_prometheus_exporter_yaml.txt

dhoard commented 7 months ago

@liransliveperson can you provide the requested debug output?

liransliveperson commented 7 months ago

@dhoard we found the issue in the rules that caused the duplicate values, we will proceed with the main version inorder to solve the open files issue.

dhoard commented 6 months ago

@liransliveperson can you provide an update on the CLOSE_WAIT socket issue you were experiencing? Does it still occur using code the code in main?

dhoard commented 5 months ago

@liransliveperson version 1.0.1 has been released which should resolve the CLOSE_WAIT socket issue.