prometheus / jmx_exporter

A process for exposing JMX Beans via HTTP for Prometheus consumption
Apache License 2.0
3.03k stars 1.2k forks source link

/metrics Endpoint of Hazelcast Does Not Return Any Data #1006

Open suavebajaj opened 1 week ago

suavebajaj commented 1 week ago

I'm facing an issue where the Hazelcast metrics endpoint (/metrics) does not return any data in one of my Google Kubernetes Engine (GKE) clusters, while it functions correctly in others. The only difference between them is the cluster members. The working cluster has 3 members while the non-working one has 15 members

Hazelcast version is 3.7.4

jmx version is 0.20.0

Heap allocated is 6Gb

Expected Behavior: In my working clusters, I can retrieve metrics using the following command:

curl http://127.0.0.1:1099/metrics

This command returns the expected metrics data, such as:

# HELP jmx_config_reload_success_total Number of times configuration have successfully been reloaded.
# TYPE jmx_config_reload_success_total counter
jmx_config_reload_success_total 0.0
...

Observed Behavior: In the non-working cluster, executing the same command hangs indefinitely:

root@hazelcast-0:/# curl -vvv -k http://127.0.0.1:1099/metrics
* Expire in 0 ms for 6 (transfer 0x56f0010850f0)
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x56f0010850f0)
* connect to 127.0.0.1 port 1099 failed: Connection timed out
* Failed to connect to 127.0.0.1 port 1099: Connection timed out
* Closing connection 0
curl: (7) Failed to connect to 127.0.0.1 port 1099: Connection timed out

Below is the configuration file

#see: https://github.com/prometheus/jmx_exporter#configuration
startDelaySeconds: 0
ssl: false
lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
  # see "MBean Naming for Hazelcast Data Structures" here: https://docs.hazelcast.org/docs/latest-dev/manual/html-single/index.html#monitoring-with-jmx
  # example input: "com.hazelcast<instance=_hzInstance_1_dev, name="hz:scheduled", type=HazelcastInstance.ManagedExecutorService><>completedTaskCount"
  - pattern: 'com\.hazelcast<instance=(.*), name=(.*), type=(.*)><>(.*):(.*)'
    labels:
      "hz_instance": "$1"
      "hz_name": "$2"
      "hz_type": "$3"
    name: "hazelcast_$4"
  # Fallback to the default pattern for anything not matching above
  - pattern: '.*'
cat /etc/manh/hazelcast_config.xml
<?xml version="1.0" encoding="UTF-8"?>
<hazelcast xsi:schemaLocation="http://www.hazelcast.com/schema/config hazelcast-config-3.6.xsd"
       xmlns="http://www.hazelcast.com/schema/config"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <management-center enabled="false">http://localhost:8080/mancenter</management-center>
  <properties>
        <property name="hazelcast.jmx">true</property>
        <property name="hazelcast.rest.enabled">true</property>
  </properties>
  <map name="authserver.user">
    <time-to-live-seconds>60</time-to-live-seconds>
  </map>
  <map name="zuulserver.userGrants">
    <time-to-live-seconds>60</time-to-live-seconds>
  </map>
  <map name="zuulserver.resources">
    <time-to-live-seconds>60</time-to-live-seconds>
  </map>
root@hazelcast-0:/# ps afx | grep java
    234 pts/0    S+     0:00  \_ grep java
      1 ?        Ssl    9:21 java -javaagent:/data/hazelcast/jmx_prometheus_javaagent-0.20.0.jar=1099:/etc/manh/hazelcast_exporter_config.yml -Xmx6144m -Xss1024k -Dlogging.level.com.manh.cp=DEBUG -Dlogging.level.com.netflix=WARN -Dlogging.level.com.hazelcast.nio.tcp=WARN -XX:+DoEscapeAnalysis -XX:+UseG1GC -XX:MaxGCPauseMillis=2000 -verbose:gc -Xloggc:/mnt/logs/hazelcastserver_G1-gc.log -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/mnt/logs/hazelcastserver_oom.hprof -XX:+DisableExplicitGC -Djavax.net.ssl.trustStore=/mnt/truststore.jks -Deureka.client.registerWithEureka=true -jar /main.jar

Steps Taken:

What additional troubleshooting steps or best practices can help diagnose this issue further?

dhoard commented 1 week ago

@suavebajaj The curl output...

* connect to 127.0.0.1 port 1099 failed: Connection timed out
* Failed to connect to 127.0.0.1 port 1099: Connection timed out

... indicates a connection issue. Curl isn't connecting to the exporter.

Some common debugging steps:

netstat -tln
nslookup 127.0.0.1
nslookup localhost