open-telemetry / opentelemetry-java-instrumentation

OpenTelemetry auto-instrumentation and instrumentation libraries for Java
https://opentelemetry.io
Apache License 2.0
1.96k stars 857 forks source link

The jmx-metrics java agent can not get JMS queue depth metrics in Liberty JMS server #9811

Closed xwgao closed 11 months ago

xwgao commented 1 year ago

Describe the bug

I hit a problem when using jmx-metrics java agent in my Liberty JMS server to capture the queue depth metrics. I created a configuration rule yaml file as below:

---
rules:
      - bean: WebSphere:feature=wasJmsServer,type=Queue,name=*
        mapping:
          Depth:
            metric: my.own.jms.queue.depth
            type: updowncounter
            desc: The JMS message queue depth
            unit: "1"
      - bean: java.lang:type=Threading
        mapping:
          ThreadCount:
            metric: my.own.jvm.thread.count
            type: updowncounter
            desc: The current number of threads
            unit: "1"

And specify the env var JAVA_TOOL_OPTIONS to be -Dotel.jmx.config=/config/manage/oteljmx/config.yaml -javaagent:path/to/opentelemetry-javaagent.jar. Then after my liberty JMS server started, I can not find the queue beans from the bean list (retrieved using the jmxterm tool). However, before applying the JMX metrics java agent, I can find the queue beans from the bean list via the jmxterm tool as below.

#domain = WebSphere:
WebSphere:feature=channelfw,name=wasJmsEndpoint240,type=endpoint
WebSphere:feature=kernel,name=ServerEndpointControl
WebSphere:feature=kernel,name=ServerInfo
WebSphere:feature=wasJmsServer,name=Default.Queue,type=Queue
WebSphere:feature=wasJmsServer,name=Default.Topic.Space,type=Topic
WebSphere:feature=wasJmsServer,name=_PSIMP.TDRECEIVER_8CD8CB4153AB6690,type=Queue
WebSphere:feature=wasJmsServer,name=_SYSTEM.Exception.Destination,type=Queue
WebSphere:feature=wasJmsServer,name=cqinbd,type=Queue
WebSphere:feature=wasJmsServer,name=cqinerrbd,type=Queue
WebSphere:feature=wasJmsServer,name=cqoutbd,type=Queue
WebSphere:feature=wasJmsServer,name=cqouterrbd,type=Queue
WebSphere:feature=wasJmsServer,name=defaultME,type=MessagingEngine
WebSphere:feature=wasJmsServer,name=notfbd,type=Queue
WebSphere:feature=wasJmsServer,name=notferrbd,type=Queue
WebSphere:feature=wasJmsServer,name=sqinbd,type=Queue
WebSphere:feature=wasJmsServer,name=sqoutbd,type=Queue

In addition, for the other configuration rule defined for mapping the attribute ThreadCount of bean java.lang:type=Threading to metrics my.own.jvm.thread.count. It works well. I can see the metrics my_own_jvm_thread_count when calling the prometheus endpoint of my OTEL collector as below. But I can not see the metrics my_own_jms_queue_depth generated for the queue depth of my liberty queue beans.

# HELP my_own_jvm_thread_count The current number of threads
# TYPE my_own_jvm_thread_count gauge
my_own_jvm_thread_count{job="sheep1030-masdev-jmsserver"} 77 1698823701860

Steps to reproduce

  1. Create a configuration rule yaml file as below:
    ---
    rules:
      - bean: WebSphere:feature=wasJmsServer,type=Queue,name=*
        mapping:
          Depth:
            metric: my.own.jms.queue.depth
            type: updowncounter
            desc: The JMS message queue depth
            unit: "1"
      - bean: java.lang:type=Threading
        mapping:
          ThreadCount:
            metric: my.own.jvm.thread.count
            type: updowncounter
            desc: The current number of threads
            unit: "1"
  2. Specify the env var JAVA_TOOL_OPTIONS to be -Dotel.jmx.config=/config/manage/oteljmx/config.yaml -javaagent:path/to/opentelemetry-javaagent.jar.
  3. Start my liberty JMS server, then I can not find the queue beans from the bean list (retrieved using the jmxterm tool). However, before applying the JMX metrics java agent, I can find the queue beans from the bean list via the jmxterm tool as below.
    #domain = WebSphere:
    WebSphere:feature=channelfw,name=wasJmsEndpoint240,type=endpoint
    WebSphere:feature=kernel,name=ServerEndpointControl
    WebSphere:feature=kernel,name=ServerInfo
    WebSphere:feature=wasJmsServer,name=Default.Queue,type=Queue
    WebSphere:feature=wasJmsServer,name=Default.Topic.Space,type=Topic
    WebSphere:feature=wasJmsServer,name=_PSIMP.TDRECEIVER_8CD8CB4153AB6690,type=Queue
    WebSphere:feature=wasJmsServer,name=_SYSTEM.Exception.Destination,type=Queue
    WebSphere:feature=wasJmsServer,name=cqinbd,type=Queue
    WebSphere:feature=wasJmsServer,name=cqinerrbd,type=Queue
    WebSphere:feature=wasJmsServer,name=cqoutbd,type=Queue
    WebSphere:feature=wasJmsServer,name=cqouterrbd,type=Queue
    WebSphere:feature=wasJmsServer,name=defaultME,type=MessagingEngine
    WebSphere:feature=wasJmsServer,name=notfbd,type=Queue
    WebSphere:feature=wasJmsServer,name=notferrbd,type=Queue
    WebSphere:feature=wasJmsServer,name=sqinbd,type=Queue
    WebSphere:feature=wasJmsServer,name=sqoutbd,type=Queue

    In addition, for the other configuration rule defined for mapping the attribute ThreadCount of bean java.lang:type=Threading to metrics my.own.jvm.thread.count. It works well. I can see the metrics my_own_jvm_thread_count when calling the prometheus endpoint of my OTEL collector as below. But I can not see the metrics my_own_jms_queue_depth generated for the queue depth of my liberty queue beans.

    # HELP my_own_jvm_thread_count The current number of threads
    # TYPE my_own_jvm_thread_count gauge
    my_own_jvm_thread_count{job="sheep1030-masdev-jmsserver"} 77 1698823701860

Expected behavior

After applying the jmx-metrics java agent,

  1. I can still find the queue beans from the bean list retrieved using the jmxterm tool.
  2. I can see the metrics my_own_jms_queue_depth generated for the queue depth of my liberty queue beans.

Actual behavior

After applying the jmx-metrics java agent,

  1. I can not find the queue beans from the bean list retrieved using the jmxterm tool.
  2. I can not see the metrics my_own_jms_queue_depth generated for the queue depth of my liberty queue beans.

Javaagent or library instrumentation version

1.30.0

Environment

JDK: WebSphere Application Server 23.0.0.8/wlp-1.0.80.cl230820230807-0401) on Eclipse OpenJ9 VM openj9-0.40.0, version 11.0.20+8 (en_US) OS: Linux 4.18.0-372.73.1.el8_6.x86_64

Additional context

No response

trask commented 1 year ago

cc @PeterF778 in case you have any ideas

PeterF778 commented 1 year ago

@xwgao Are there any error messages produced by the agent? They go to stderr by default, which is then directed by WebSphere to a file by default. If there are no errors, please try using -Dotel.javaagent.debug=true to get more verbose logging.

This is probably not causing this problem, but your rule WebSphere:feature=wasJmsServer,type=Queue,name=*has a wildcard and matches multiple MBeans, so you should report name as a metric attribute. Otherwise the values for multiple MBeans will clash and make your metric non-deterministic.

xwgao commented 1 year ago

@PeterF778 How can I report name as a metric attribute? Can I use the below configuration rule to do that?

---
    rules:
      - bean: WebSphere:feature=wasJmsServer,type=Queue,name=*
        metricAttribute:
          name: param(name)
        mapping:
          Depth:
            metric: my.own.jms.queue.depth
            type: updowncounter
            desc: The JMS message queue depth
            unit: "1"
      - bean: java.lang:type=Threading
        mapping:
          ThreadCount:
            metric: my.own.jvm.thread.count
            type: updowncounter
            desc: The current number of threads
            unit: "1"

Actually, when I removed the first rule for bean WebSphere:feature=wasJmsServer,type=Queue,name=*, the result is the same: After applying the jmx-metrics java agent, I can not find the queue beans from the bean list retrieved using the jmxterm tool and I can not see the metrics my_own_jms_queue_depth generated for the queue depth of my liberty queue beans.

And I've added -Dotel.javaagent.debug=true and the verbose logs are given in attached file verbose.log. There are some error messages (the lines starting with [err]). verbose.log

laurit commented 1 year ago

@xwgao instead of Depth try using MaxQueueDepth. It seems that metrics with value 0 don't show up in prometheus output (but you can see them if you turn on logging in the collector). Using MaxQueueDepth should make this easier to debug for you, change it back after you have verified it works. Try

      MaxQueueDepth:
        metric: my.own.jms.queue.depth
        type: updowncounter
        desc: The JMS message queue depth
        unit: "1"
        metricAttribute:
          name: param(name)
PeterF778 commented 1 year ago

@xwgao Yes, your updated configuration looks right (as well as the configuration proposed by @laurit ).

From the symptoms, it looks like the MBean responsible for the WebSphere metrics somehow got crippled when queried for matching ObjectNames. It is strange that it did not produce any error messages or exceptions - at least I do not see any in the attached log file. BTW, the lines starting with [err] indicate captured stderr entries, not necessarily actual errors.

Are there any other log files produced by your server?

xwgao commented 1 year ago

@PeterF778 I tried with the configuration proposed by @laurit. The result is the same: After applying the jmx-metrics java agent, I can not find the queue beans from the bean list retrieved using the jmxterm tool and I can not see the metrics my_own_jms_queue_depth generated. Actually, as I stated, when I removed the first rule for bean WebSphere:feature=wasJmsServer,type=Queue,name=*, the result is the same.

I found another log file messages.log generated by my liberty server but it looks the same as the one I sent to you already. messages.log verbose.20231108.log

I found the following messages from the log files, is this related to the issue?

[otel.javaagent 2023-11-07 07:18:36:488 +0000] [main] DEBUG io.opentelemetry.javaagent.tooling.SafeServiceLoader - Unable to load instrumentation class: JVMCFRE199E bad major version 61.0 of class=io/opentelemetry/javaagent/shaded/instrumentation/javaagent/runtimemetrics/java17/Java17RuntimeMetricsInstaller, the maximum supported major version is 55.0; offset=6
...
[otel.javaagent 2023-11-07 07:18:37:308 +0000] [main] DEBUG io.opentelemetry.javaagent.tooling.SafeServiceLoader - Unable to load instrumentation class: JVMCFRE199E bad major version 61.0 of class=io/opentelemetry/javaagent/instrumentation/spring/jms/v6_0/SpringJmsInstrumentationModule, the maximum supported major version is 55.0; offset=6
[otel.javaagent 2023-11-07 07:18:37:349 +0000] [main] DEBUG io.opentelemetry.javaagent.tooling.SafeServiceLoader - Unable to load instrumentation class: JVMCFRE199E bad major version 61.0 of class=io/opentelemetry/javaagent/instrumentation/spring/webmvc/v6_0/SpringWebMvcInstrumentationModule, the maximum supported major version is 55.0; offset=6
laurit commented 1 year ago

Unable to load instrumentation class: JVMCFRE199E bad major version 61.0 is expected. Some instrumentations target java 17, loading them is expected to fail on jdk 11. To help you we need to reproduce this issue and we need your help to do that. I tried with wlp-jakartaee10-23.0.0.10 with default configuration and can see the metric for the default queue. You'll have to narrow your issue down, try with a fresh liberty install with default configuration and see if it works then. If it does not try with a openjdk instead of openj9. If you manage to find the combination that fails describe it with enough detail to allow others to reproduce

xwgao commented 11 months ago

@laurit @PeterF778 I tried upgrading my WebSphere Liberty to 23.0.0.10 and java to 17. Although the error Unable to load instrumentation class: JVMCFRE199E bad major version 61. was gone, the result was the same: After applying the jmx-metrics java agent,

Attached is the verbose log of my Liberty server after upgrading. verbose.20231114.log

In addition, my Liberty JMS server is running inside docker container which uses the base image ibmcom/websphere-liberty:23.0.0.10-kernel-java17-openj9-ubi.

laurit commented 11 months ago

@xwgao could you provide instructions for reproducing this? Perhaps you could provide a dockerfile that would allow building an image that reproduces the issue?

github-actions[bot] commented 11 months ago

This has been automatically marked as stale because it has been marked as needing author feedback and has not had any activity for 7 days. It will be closed automatically if there is no response from the author within 7 additional days from this comment.

xwgao commented 11 months ago

@laurit The instructions for reproducing this issue are as follows: In OpenShift 4.12.36 cluster,

  1. Install Community OpenTelemetry Operator v0.89.0 on OperatorHub UI.
  2. Create an OpenTelemetry instrumentation.
  3. Create an OpenTelemetry collector.
  4. Create a ConfigMap as below.
    kind: ConfigMap
    apiVersion: v1
    metadata:
    name: otel-jmx-config
    namespace: my-namespace
    data:
    config.yaml: |
    ---
    rules:
      - bean: WebSphere:feature=wasJmsServer,type=Queue,name=*
        metricAttribute:
          name: param(name)
        mapping:
          Depth:
            metric: my.own.jms.queue.depth
            type: updowncounter
            desc: The JMS message queue depth
            unit: "1"
      - bean: java.lang:type=Threading
        mapping:
          ThreadCount:
            metric: my.own.jvm.thread.count
            type: updowncounter
            desc: The current number of threads
            unit: "1"
  5. Create a StatefulSet which mounts the ConfigMap and has the annotation instrumentation.opentelemetry.io/inject-java: instrumentation as below.
    kind: StatefulSet
    apiVersion: apps/v1
    metadata:
    name: wlp-jmsserver-statefulset
    namespace: my-namespace
    ...
    spec:
    replicas: 1
    ...
    template:
      ...
      annotations:
        instrumentation.opentelemetry.io/inject-java: instrumentation
        ...
    spec:
      ...
      containers:
        - name: jmsserver
          command:
            - /bin/bash
            - '-c'
            - '--'
          env:
            ...
            - name: JAVA_TOOL_OPTIONS
              value: ' -Dotel.jmx.config=/config/manage/oteljmx/config.yaml -Dotel.javaagent.debug=true'
          ports:
            - containerPort: 9080
              protocol: TCP
            - containerPort: 9443
              protocol: TCP
            - containerPort: 7276
              protocol: TCP
            - containerPort: 7286
              protocol: TCP
          imagePullPolicy: IfNotPresent
          volumeMounts:
            ...
            - name: otel-jmx-config
              mountPath: /config/manage/oteljmx
          ...
          args:
            - >-
              echo copy additional jvm options is exists; /config/fips-check.sh;
              cp
              /opt/ibm/wlp/usr/servers/defaultServer/manage/jvmoptions/jvm.options
              /opt/ibm/wlp/usr/servers/defaultServer/configDropins/overrides/.;
              echo generateKeyStore...; openssl pkcs12 -export -in
              /etc/ssl/certs/internal-manage-tls/tls.crt -inkey 
              /etc/ssl/certs/internal-manage-tls/tls.key -out ${java_keystore}
              -password pass:${java_keystore_password}; /tmp/startwlp.sh;
      ...
      volumes:
        ...
        - name: otel-jmx-config
          configMap:
            name: otel-jmx-config
            defaultMode: 420
      ...
    ...
  6. Then after the pod of my StatefulSet is running, check the pod log for the WebSphere Liberty JMS server logs.

The image used in my StatefulSet can be built with the Dockerfile contained in the attached wlp-jms-image.zip file. You can unzip the attached zip file and then run build -t <image_tag> . within the extracted folder. Since the base image is ibmcom/websphere-liberty:23.0.0.10-kernel-java17-openj9-ubi. You need to docker login docker.io before you run docker build command to build the image. wlp-jms-image.zip

Please feel free to let me know if you meet any problem or need any further information. Thanks a lot.

laurit commented 11 months ago

@xwgao is the OpenShift really needed? Would it still reproduce if you'd just build the image and copy the agent into it? Having a simple reproducer that requires minimal effort to set up greatly increases the chances of someone trying it out.

xwgao commented 11 months ago

@laurit Yes, it should be still reproducible if you'd just build the image and copy the agent into it. Thanks.

xwgao commented 11 months ago

@laurit Are you able to reproduce by building the image and copying the agent jar into it? Please feel free to let me know if you met any problem or need any help. Thanks a lot.

xwgao commented 11 months ago

@laurit @PeterF778 Is there any update for this issue? Thanks a lot.

trask commented 11 months ago

@xwgao can you create a smaller example that can be run locally in docker?

xwgao commented 11 months ago

@trask Which example do you need? You can start the WebSphere Liberty server via the /tmp/startwlp.sh (contained in the docker image already) when running locally in docker.

xwgao commented 11 months ago

@trask @laurit Is there any update for this issue? Thanks a lot.

laurit commented 11 months ago

@xwgao Try using JVM_ARGS instead of JAVA_TOOL_OPTIONS or add -Djavax.management.builder.initial=com.ibm.ws.kernel.boot.jmx.internal.PlatformMBeanServerBuilder to jvm arguments. Can be reproduced without otel agent by adding -Dcom.sun.management.jmxremote to JAVA_TOOL_OPTIONS. This property is set in ws-javaagent.jar agent https://github.com/OpenLiberty/open-liberty/blob/f961357954b2610a98f1f4b5b6d5d416eecd1ea0/dev/com.ibm.ws.kernel.instrument.agent/src/com/ibm/ws/kernel/instrument/BootstrapAgent.java#L123 I think the issue here is that when our agent is in JAVA_TOOL_OPTIONS it runs before the ws-javaagent.jar.

xwgao commented 11 months ago

@laurit It works after adding -Djavax.management.builder.initial=com.ibm.ws.kernel.boot.jmx.internal.PlatformMBeanServerBuilder into JAVA_TOOL_OPTIONS. Thanks a lot!

xwgao commented 11 months ago

I can get my queue depth metrics my_own_jms_queue_depth generated by following @laurit 's comment above. This issue is resolved now, so close it.

PeterF778 commented 11 months ago

Thank you, @laurit , for taking the time and effort to investigate this ticket.