open-telemetry / opentelemetry-java-instrumentation

OpenTelemetry auto-instrumentation and instrumentation libraries for Java
https://opentelemetry.io
Apache License 2.0
1.89k stars 825 forks source link

Runtime-telemetry-Java17 support on OpenJ9 #10443

Open tajila opened 7 months ago

tajila commented 7 months ago

Is your feature request related to a problem? Please describe.

Java17 support inculdes additional metrics such as CPU_COUNT_METRICS and LOCK_METRICS described here. On OpenJ9 JVM these metrics are not available as the runtime-telemetry-java17 support depends on JFR streaming capabilities which is based on Hotspot JVM.

Describe the solution you'd like

I would like to ask if the runtime-telemetry community is open to working with the OpenJ9 Community to provide support for the metrics introduced in java17. The high level idea would be for OpenJ9 to provide some additional MXBean APIs that supply the required data for the new metrics. These would then be integrated in runtime-telemetry-java17.

Describe alternatives you've considered

One alternative is to create a Runtime-telemetry agent for OpenJ9. But this approach is not preferred.

Additional context

Please let me know if this is not the correct place to raise this request.

trask commented 7 months ago

I would like to ask if the runtime-telemetry community is open to working with the OpenJ9 Community to provide support for the metrics introduced in java17. The high level idea would be for OpenJ9 to provide some additional MXBean APIs that supply the required data for the new metrics. These would then be integrated in runtime-telemetry-java17.

hi @tajila! yes, we would welcome any contributions to improve OpenTelemetry's OpenJ9 support

tajila commented 7 months ago

Okay, thanks @trask

Below is a rough outline of what I am proposing. Much of it is based on my unsderstanding of the existing OTEL support in Java17 which may be incomplete. So I would appreciate your feedback.

BUFFER_METRICS

My understanding is that these are DirectByteBuffer metrics. If so, then the existing BufferPoolMXBean should be able to provide the requried data.

- process.runtime.jvm.buffer.count : long java.lang.management.BufferPoolMXBean.getCount()

- process.runtime.jvm.buffer.limit : long java.lang.management.BufferPoolMXBean.getTotalCapacity()

- process.runtime.jvm.buffer.usage : long java.lang.management.BufferPoolMXBean.getMemoryUsed()

CLASS_LOAD_METRICS

Likewise with the classloading metrics, one could use the existing ClassLoadingMXBean

- process.runtime.jvm.classes.current_loaded : long java.lang.management.ClassLoadingMXBean.getLoadedClassCount()

- process.runtime.jvm.classes.loaded : long java.lang.management.ClassLoadingMXBean.getTotalLoadedClassCount()

- process.runtime.jvm.classes.unloaded : long java.lang.management.ClassLoadingMXBean.getUnloadedClassCount()

CONTEXT_SWITCH_METRICS

For this one I am thinking we could add to the existing J9 JvmCpuMonitorMXBean to add an additional API the returns the context switch rate. Internally, we would periodically poll the number of context switches with a fixed interval, then return a context switch rate per second.

- process.runtime.jvm.cpu.context_switch : float com.ibm.lang.management.JvmCpuMonitorMXBean.getContextSwitchRate() //Not yet implemented

CPU_COUNT_METRICS

Here I believe we can use the OperatingSystemMXBean which has a method that returns getProcessingCapacity. I think the key here is that we take into account any CPU limits if the JVM is being run in a virtualized environment. This API currently doesn't do that on J9, but we do so in other places so we can add the same treatment here.

   - process.runtime.jvm.cpu.limit : int com.ibm.lang.management.OperatingSystemMXBean.getProcessingCapacity

CPU_UTILIZATION_METRICS

Likewise, we can use the OperatingSystemMXBean to query process CPU load. We can also enahnce it to return machine CPU load.

- process.runtime.jvm.cpu.utilization : double com.ibm.lang.management.OperatingSystemMXBean.getProcessCpuLoad()

- process.runtime.jvm.system.cpu.utilization : double com.ibm.lang.management.OperatingSystemMXBean.getMachineCpuLoad() //Not yet implemented

GC_DURATION_METRICS

Here I believe the GarbageCollectorMXBean has the relevant data. It looks like OTEL currently registers a handler that accumulates the GC collection times. The API below works differently in that it returns the cummulative time.

- process.runtime.jvm.gc.duration : long java.lang.management.GarbageCollectorMXBean.getCollectionTime()

LOCK_METRICS

Here we can add a method to ThreadMXBean that returns the total lock wait time for a given thread.

- process.runtime.jvm.cpu.longlock : long com.ibm.lang.management.ThreadMXBean.getLockWaitTimes(long tid) //Not yet implemented

MEMORY_ALLOCATION_METRICS

My understanding is that this simply reports the thread local and non-thread local abject allocation amounts. I propose adding a new API to GarbageCollectorMXBean that returns the cummulative object allocation metrics.

- process.runtime.jvm.memory.allocation : long com.ibm.lang.management.GarbageCollectorMXBean.getTotalObjectMemoryAllocated()//Not yet implemented

NETWORK_IO_METRICS

Here I propose adding a new MXbean that returns the cummulative network IO stats.

- process.runtime.jvm.network.io : long com.ibm.lang.management.NetworkMXBean.getTotalIOBytes() //Not yet implemented

- process.runtime.jvm.network.time : long com.ibm.lang.management.NetworkMXBean.getTotalIOtime() //Not yet implemented

THREAD_METRICS

I believe we can use the exsting ThreadMXBean APIs for this.

- process.runtime.jvm.threads.count : int java.lang.management.ThreadMXBean.getThreadCount()

MEMORY_POOL_METRICS

My understanding of this is that this reports metrics for various JVM components (java heap, metaspace, jit code cache,...). J9's internal memory management differs from hotspot so there may not be a direct parallel for each of these. That being said I think we can have a similar division where, JIT, GC and Class memory stats are reported separately.

- process.runtime.jvm.memory.committed : long com.ibm.lang.management.[Classloading|JIT|GC]MXBean.getTotalMemoryCommitted()

- process.runtime.jvm.memory.init : long com.ibm.lang.management.[Classloading|JIT|GC]MXBean.getInitialMemoryRequested()

- process.runtime.jvm.memory.limit : long com.ibm.lang.management.[Classloading|JIT|GC]MXBean.getMaxMemoryLimit()

- process.runtime.jvm.memory.usage : long com.ibm.lang.management.[Classloading|JIT|GC]MXBean.getMemoryUsed()

- process.runtime.jvm.memory.usage_after_last_gc : long com.ibm.lang.management.[Classloading|JIT|GC]MXBean.getMemoryUsedAfterLastGC()
//None of the above are currently implemented. 

We will also need a way to differentiate between hotspot, J9 and older versions of J9 that do not have the enhanced MXBeans. For this I propose adding a property to newer JDKs "org.eclipse.openj9.extendedMXBeanVersion=[1.XX]".

Please let me know your thoughts and if I've misunderstood any of the OTEL behaviour. I expect we will need to go back and forth to iron out something that will work.

tajila commented 6 months ago

@trask Any thoughts on the next steps?

trask commented 6 months ago

hi @tajila,

My understanding is that these are DirectByteBuffer metrics. If so, then the existing BufferPoolMXBean should be able to provide the requried data.

the java17 implementations of some of these metrics are alternative implementations of the java8 versions, see https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/runtime-telemetry/runtime-telemetry-java8/library

in general, the java17 "alternative" implementations are disabled by default, see the "Default Enabled" column on https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/runtime-telemetry/runtime-telemetry-java17/library

tajila commented 6 months ago

Hi @trask

in general, the java17 "alternative" implementations are disabled by default, see the "Default Enabled" column on https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/runtime-telemetry/runtime-telemetry-java17/library

Thanks, this makes more sense.

So if one wanted to add support for Java17 metrics without alternate implementations (e.g. jvm.cpu.longlock, jvm.network.io) without the use of JFR, what would be the best way to do that?

trask commented 6 months ago

you could add them to https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/runtime-telemetry/runtime-telemetry-java17/library

the runtime-telemetry-java17 module is for anything that's only supported in Java 17 and later (doesn't have to be JFR-based)