Closed anilsomisetty closed 5 months ago
We are also having this problem. We have found it can be so bad it will stop the application running in this case a cluster of Trino instances. Do you have any time to work on this? We had to remove it from our production env.
The core HTTP server code is in the Prometheus client_java
library. It will have to expose configuration to allow setting the socket timeout. I still have concerns around the RMI socket timeouts.
prometheus-metrics-exporter-httpserver
I create an issue for the client_java
library (https://github.com/prometheus/client_java/issues/924) Once added, the configuration can be exposed via the exporter configuration.
I have tested my application with the latest changes available in main branch of the exporter which uses latest version of prometheus http server from client java repository and I see that when I'm using that my file descriptor connections stuck in CLOSE_WAIT state are getting cleared, could you please confirm what changes have been made that the fd's are getting cleared without adding the timeout ? When would be the official release of the latest version of jmx exporter which is integrated with client_java version of v1.1.0 ?
@anilsomisetty I am not aware of any direct changes in client_java
v1.1.0 that would have changed the behavior regarding CLOSE_WAIT sockets.
We are targeting a beta release (due to the major changes made in client_java
) of the main code in the next 2 weeks.
@dhoard I see that due to the changes that have been made in this commit of client java code https://github.com/prometheus/client_java/commit/1966186deaaa83aec496d88ff604a90795bad688 after version 0.16.0 release the connections getting stuck in CLOSE_WAIT state are getting cleared after 2 minutes as they are in idle state.
I have tested the 0.19.0 version of jmx exporter integrated with 0.16.0 release of client java with this commit changes included and have seen that the close wait stuck connection are getting closed which was not the case without this commit changes.
@anilsomisetty Interesting observation.
The issue with the code in https://github.com/prometheus/client_java/commit/1966186deaaa83aec496d88ff604a90795bad688 is that the ThreadPoolExecutor
is unbounded.
The code in the latest release should behave the same way.
Is it possible for you build and test the main branch (unreleased) to see if you can reproduce the issue?
@dhoard I have build the main branch code of jmx exporter which is using client java version v1.1.0 and I'm not able to reproduce the issue, the stuck close wait connections are getting closed and are cleared.
In the latest code i see ThreadPoolExecutor has been bounded by corepoolsize and maxpoolsize so it makes this bounded am I right ?
@anilsomisetty correct - the ThreadPoolExecutor
is now bounded.
Thanks for testing the code in main (unreleased)!
Do you experience the issue with the latest release 0.20.0?
@dhoard yes, the issue persists in the latest release 0.20.0 As there is a default limit of max 10 threads set in this latest release when all 10 connections are getting stuck in close wait state no new connections are getting created.
Could you please tell me when is the plan to have a generally available official release if beta is planned in next 2 weeks ?
Hi, we are facing the same issue, our java application is using : -javaagent:jmx_prometheus_javaagent-0.19.0.jar=14102 And we are getting aggregation of file descriptor files stuck in CLOSE_WAIT, we need to restart our application in production once every 2 weeks.
here are some of the file descryptor stuck : java 19727 appuser 223u IPv4 1411652589 0t0 TCP ssor-jet135.lpdomain.com:14102->svor-mon28.lpdomain.com:59184 (CLOSE_WAIT) java 19727 appuser 224u IPv4 1411912918 0t0 TCP ssor-jet135.lpdomain.com:14102->svor-mon28.lpdomain.com:56524 (CLOSE_WAIT) java 19727 appuser 225u IPv4 1412503559 0t0 TCP ssor-jet135.lpdomain.com:14102->svor-mon28.lpdomain.com:53210 (CLOSE_WAIT) java 19727 appuser 226u IPv4 1411689546 0t0 TCP ssor-jet135.lpdomain.com:14102->svor-mon28.lpdomain.com:34842 (CLOSE_WAIT) java 19727 appuser *227u IPv4 1412056270 0t0 TCP ssor-jet135.lpdomain.com:14102->svor-mon28.lpdomain.com:55478 (CLOSE_WAIT)
We have around 20k of those and then need to restart the application.
Is there an expected solution for that soon ?
@anilsomisetty @liransliveperson I have updated the main
branch to use client_java
1.2.0, which has some fixes around closing the HttpExchange
if there is an exception writing the metrics response to the client.
Is it possible for you to build/test main
?
@dhoard Im using the main version, I have created a jar from it and using it. When trying to load the metrics I'm getting :
An Exception occurred while scraping metrics: java.lang.IllegalArgumentException: Duplicate labels in metric data: {brandId="10138221"}
at io.prometheus.metrics.model.snapshots.MetricSnapshot.validateLabels(MetricSnapshot.java:46)
at io.prometheus.metrics.model.snapshots.MetricSnapshot.
This issue does not happening with jmx_prometheus_javaagent-0.19.0.jar Could you please advise ?
@liransliveperson interesting. The current code is main
which uses client_java
1.2.0, which is completely new.
I would check your rules. The exception is stating that you are trying to add a duplicate label - label names have to be unique.
@dhoard I have check my rules, there are no duplicate rules, I have used the same rules, one with the main version here-> that produces exception, Then with the jmx_prometheus_javaagent-0.19.0.jar that not produces the exception. The rules are exactly the same in both cases. there is some issue in your main code.
@liransliveperson This appears to be an exporter rule configuration issue that the older version of client_java
allowed (but was invalid.) An exporter rule must produce a unique metric (name + labels.) This can occur if a rule uses a subset of values as labels.
Can you provide the specific exporter rule and a list of attributes on the MBean?
@dhoard we have many rules and attributes in the MBean its hard to put it all in here. Is there an option to add a code in your project where we will able to know which datapoints are the duplicates? That will help us to find the problem
@liransliveperson we don't have access to the underlying information in the JMX Exporter, but I feel we could add it to the exception when implementing https://github.com/prometheus/client_java/issues/942
Here is a JUnit test that reproduces the issue:
package io.prometheus.jmx;
import io.prometheus.metrics.model.snapshots.Labels;
import io.prometheus.metrics.model.snapshots.MetricSnapshots;
import io.prometheus.metrics.model.snapshots.UnknownSnapshot;
import org.junit.Before;
import org.junit.Test;
import java.util.HashMap;
import java.util.Map;
import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertThrows;
public class DuplicateLabels {
Map<String, UnknownSnapshot.Builder> unknownMap;
@Before
public void setUp() {
unknownMap = new HashMap<>();
}
@Test
public void testDuplicateLabels() {
UnknownSnapshot.Builder unknownBuilder =
unknownMap.computeIfAbsent(
"test",
name ->
UnknownSnapshot.builder()
.name("test_metric")
.help("test_metric help"));
unknownBuilder.dataPoint(
UnknownSnapshot.UnknownDataPointSnapshot.builder()
.value(1.12345678)
.labels(
Labels.of(
"label1", "value1"))
.build());
unknownMap.put("test", unknownBuilder);
unknownBuilder =
unknownMap.computeIfAbsent(
"test",
name ->
UnknownSnapshot.builder()
.name("test_metric")
.help("test_metric help"));
unknownBuilder.dataPoint(
UnknownSnapshot.UnknownDataPointSnapshot.builder()
.value(2.2468)
.labels(
Labels.of(
"label1", "value1"))
.build());
unknownMap.put("test2", unknownBuilder);
MetricSnapshots.Builder result = MetricSnapshots.builder();
Exception exception = assertThrows(IllegalArgumentException.class, () -> {
for (UnknownSnapshot.Builder unknown : unknownMap.values()) {
UnknownSnapshot unknownSnapshot = unknown.build();
result.metricSnapshot(unknownSnapshot);
}
});
assertEquals("Duplicate labels in metric data: {label1=\"value1\"}", exception.getMessage());
}
}
@dhoard is the master ready for build with the detailed exception on duplicate labels ? Im trying to build the project but in class DuplicateLabels it cant find : import io.prometheus.metrics.model.snapshots.DuplicateLabelsException;
@dhoard here is our servers metrics : duplicate_lables.xlsx I want able to find any duplicate labels, I put that in excel sheet, every row has the row metric label, then went to 'Conditional Formatting' -> 'Highlight cell rules' -> 'Duplicate values' And I cand find any duplicate
@liransliveperson I have disabled the test. DuplicateLabelsException
will be in a new version of client_java
, which hasn't been published yet.
Until the new version of client_java
is published, the exporter code doesn't have the information to provide further information.
If you want to test it some code with the exception displaying more information...
https://github.com/prometheus/client_java
and build/install it (./mvnw clean install
)./mvnw clean package
)The jar will be in ./jmx_prometheus_javaagent/target/jmx_prometheus_javaagent-1.0.0-BETA.jar
@dhoard I wasnt able to build https://github.com/prometheus/client_java it has test error in : testGoodCase(ProcessMetricsTest.java:76)
Hmm... client_java
is showing builds passing.
https://app.circleci.com/pipelines/github/prometheus/client_java
My test branch requires the client_java
libraries built off of main
. You will have to resolve the client_java
build issues first.
Try to run localy ProcessMetricsTest
What is your OS version? JDK version?
@dhoard I was able to deploy the test version with the detailed exception, The exception says :
An Exception occurred while scraping metrics: io.prometheus.metrics.model.snapshots.DuplicateLabelsException: Duplicate labels for metric "metricsPerBrand_Acd_WaitingConvs_Value": {brandId="10138221"}
at io.prometheus.metrics.model.snapshots.MetricSnapshot.validateLabels(MetricSnapshot.java:46)
at io.prometheus.metrics.model.snapshots.MetricSnapshot.
Looking on our metrics we have multiple "metricsPerBrand_Acd_WaitingConvs_Value" for different brands :
metricsPerBrand_Acd_WaitingConvs_Value{brandId="10138221",} 0.0 metricsPerBrand_Acd_WaitingConvs_Value{brandId="87947798",} 0.0 metricsPerBrand_Acd_WaitingConvs_Value{brandId="86122336",} 0.0 metricsPerBrand_Acd_WaitingConvs_Value{brandId="12660041",} 0.0
But only one for brand Id 10138221 : metricsPerBrand_Acd_WaitingConvs_Value{brandId="10138221",}
This is the Yaml rule:
@liransliveperson Can you provide the full stacktrace?
EDIT:
I can add some test debug in my branch.
Can you enable some debugging logging by either definiing...
JMX_PROMETHEUS_EXPORTER_DEVELOPER_DEBUG=true
or
jmx.prometheus.exporter.developer.debug=true
... and should get a debug print for ever metric that's being added.
// Add to samples.
LOGGER.log(
FINE,
"add metric sample: %s %s %s %s",
matchedRule.name,
matchedRule.labelNames,
matchedRule.labelValues,
value.doubleValue());
I was able to print all the labels for a metric. In the jmx_prometheus_exporter.yaml I have only those rules :
pattern: 'metricsPerBrand<0=(\w+), 1=BotAcd, 2=singleConversationDispatch><>Mean' name: metricsPerBrand_BotAcd_SingleConv_DispatchTimeMs_Mean labels: brandId: $1
And looks like the jmx exporter sees labels :
An Exception occurred while scraping metrics: io.prometheus.metrics.model.snapshots.DuplicateLabelsException: Duplicate labels for metric "metricsPerBrand_BotAcd_SingleConv_DispatchTimeMs_Mean": {brandId="29422842"} All Labels : {brandId="29422842"},{brandId="29422842"},.
Although in the JMX we have only once the brand 29422842.
@liransliveperson can you enable the debug? (the output will be to where ever system out is captured) can you provide your complete YAML file?
Here is the Yaml file jmx_prometheus_exporter_yaml.txt
@liransliveperson can you provide the requested debug output?
@dhoard we found the issue in the rules that caused the duplicate values, we will proceed with the main version inorder to solve the open files issue.
@liransliveperson can you provide an update on the CLOSE_WAIT
socket issue you were experiencing? Does it still occur using code the code in main
?
@liransliveperson version 1.0.1 has been released which should resolve the CLOSE_WAIT
socket issue.
I am using trino where I am using the jmx exporter of version 19 to expose my application metrics over a http port for my prometheus agent to collect these metrics.
I have my prometheus agent configured to collect metrics from the exposed http port with a specific timeout.
My issue: File descriptors are stuck in CLOSE_WAIT state and their count keeps on increasing, after the file descriptor count reaches a specific limit my application is crashing and becoming unresponsive.
In my case what's happening is the file descriptor connections that has been opened between jmx exporter and my prometheus metric collection agent are getting stuck in CLOSE_WAIT state because my agent is exiting after the timeout and the exporter is not able to provide the metrics within specified timeout.
The exporter is stuck in writing metrics to the descriptor because on the other side agent reading the metrics has left and ByteArrayOutputStream is stuck in write call and the file descriptor is not getting closed at this line of code
https://github.com/prometheus/client_java/blob/ed0d7ae3b57a3986f6531d1a37db031a331227e6/simpleclient_httpserver/src/main/java/io/prometheus/client/exporter/HTTPServer.java#L126
My agent collecting the metrics leaves after specified timeout, but ideally the exporter should understand that the requestor has left and should close the connection and shouldn't get stuck in write call and shouldn't keep the fd open.
I understand that once the agent leaves after timeout and exporter will not be able to give it the entire metrics of that moment and the metrics will be lost at that timeframe which is fine for me.
Please help me with this.