Open tajila opened 9 months ago
I would like to ask if the runtime-telemetry community is open to working with the OpenJ9 Community to provide support for the metrics introduced in java17. The high level idea would be for OpenJ9 to provide some additional MXBean APIs that supply the required data for the new metrics. These would then be integrated in runtime-telemetry-java17.
hi @tajila! yes, we would welcome any contributions to improve OpenTelemetry's OpenJ9 support
Okay, thanks @trask
Below is a rough outline of what I am proposing. Much of it is based on my unsderstanding of the existing OTEL support in Java17 which may be incomplete. So I would appreciate your feedback.
My understanding is that these are DirectByteBuffer metrics. If so, then the existing BufferPoolMXBean
should be able to provide the requried data.
- process.runtime.jvm.buffer.count : long java.lang.management.BufferPoolMXBean.getCount()
- process.runtime.jvm.buffer.limit : long java.lang.management.BufferPoolMXBean.getTotalCapacity()
- process.runtime.jvm.buffer.usage : long java.lang.management.BufferPoolMXBean.getMemoryUsed()
Likewise with the classloading metrics, one could use the existing ClassLoadingMXBean
- process.runtime.jvm.classes.current_loaded : long java.lang.management.ClassLoadingMXBean.getLoadedClassCount()
- process.runtime.jvm.classes.loaded : long java.lang.management.ClassLoadingMXBean.getTotalLoadedClassCount()
- process.runtime.jvm.classes.unloaded : long java.lang.management.ClassLoadingMXBean.getUnloadedClassCount()
For this one I am thinking we could add to the existing J9 JvmCpuMonitorMXBean to add an additional API the returns the context switch rate. Internally, we would periodically poll the number of context switches with a fixed interval, then return a context switch rate per second.
- process.runtime.jvm.cpu.context_switch : float com.ibm.lang.management.JvmCpuMonitorMXBean.getContextSwitchRate() //Not yet implemented
Here I believe we can use the OperatingSystemMXBean which has a method that returns getProcessingCapacity
. I think the key here is that we take into account any CPU limits if the JVM is being run in a virtualized environment. This API currently doesn't do that on J9, but we do so in other places so we can add the same treatment here.
- process.runtime.jvm.cpu.limit : int com.ibm.lang.management.OperatingSystemMXBean.getProcessingCapacity
Likewise, we can use the OperatingSystemMXBean to query process CPU load. We can also enahnce it to return machine CPU load.
- process.runtime.jvm.cpu.utilization : double com.ibm.lang.management.OperatingSystemMXBean.getProcessCpuLoad()
- process.runtime.jvm.system.cpu.utilization : double com.ibm.lang.management.OperatingSystemMXBean.getMachineCpuLoad() //Not yet implemented
Here I believe the GarbageCollectorMXBean has the relevant data. It looks like OTEL currently registers a handler that accumulates the GC collection times. The API below works differently in that it returns the cummulative time.
- process.runtime.jvm.gc.duration : long java.lang.management.GarbageCollectorMXBean.getCollectionTime()
Here we can add a method to ThreadMXBean that returns the total lock wait time for a given thread.
- process.runtime.jvm.cpu.longlock : long com.ibm.lang.management.ThreadMXBean.getLockWaitTimes(long tid) //Not yet implemented
My understanding is that this simply reports the thread local and non-thread local abject allocation amounts. I propose adding a new API to GarbageCollectorMXBean that returns the cummulative object allocation metrics.
- process.runtime.jvm.memory.allocation : long com.ibm.lang.management.GarbageCollectorMXBean.getTotalObjectMemoryAllocated()//Not yet implemented
Here I propose adding a new MXbean that returns the cummulative network IO stats.
- process.runtime.jvm.network.io : long com.ibm.lang.management.NetworkMXBean.getTotalIOBytes() //Not yet implemented
- process.runtime.jvm.network.time : long com.ibm.lang.management.NetworkMXBean.getTotalIOtime() //Not yet implemented
I believe we can use the exsting ThreadMXBean APIs for this.
- process.runtime.jvm.threads.count : int java.lang.management.ThreadMXBean.getThreadCount()
My understanding of this is that this reports metrics for various JVM components (java heap, metaspace, jit code cache,...). J9's internal memory management differs from hotspot so there may not be a direct parallel for each of these. That being said I think we can have a similar division where, JIT, GC and Class memory stats are reported separately.
- process.runtime.jvm.memory.committed : long com.ibm.lang.management.[Classloading|JIT|GC]MXBean.getTotalMemoryCommitted()
- process.runtime.jvm.memory.init : long com.ibm.lang.management.[Classloading|JIT|GC]MXBean.getInitialMemoryRequested()
- process.runtime.jvm.memory.limit : long com.ibm.lang.management.[Classloading|JIT|GC]MXBean.getMaxMemoryLimit()
- process.runtime.jvm.memory.usage : long com.ibm.lang.management.[Classloading|JIT|GC]MXBean.getMemoryUsed()
- process.runtime.jvm.memory.usage_after_last_gc : long com.ibm.lang.management.[Classloading|JIT|GC]MXBean.getMemoryUsedAfterLastGC()
//None of the above are currently implemented.
We will also need a way to differentiate between hotspot, J9 and older versions of J9 that do not have the enhanced MXBeans. For this I propose adding a property to newer JDKs "org.eclipse.openj9.extendedMXBeanVersion=[1.XX]".
Please let me know your thoughts and if I've misunderstood any of the OTEL behaviour. I expect we will need to go back and forth to iron out something that will work.
@trask Any thoughts on the next steps?
hi @tajila,
My understanding is that these are DirectByteBuffer metrics. If so, then the existing
BufferPoolMXBean
should be able to provide the requried data.
the java17 implementations of some of these metrics are alternative implementations of the java8 versions, see https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/runtime-telemetry/runtime-telemetry-java8/library
in general, the java17 "alternative" implementations are disabled by default, see the "Default Enabled" column on https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/runtime-telemetry/runtime-telemetry-java17/library
Hi @trask
in general, the java17 "alternative" implementations are disabled by default, see the "Default Enabled" column on https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/runtime-telemetry/runtime-telemetry-java17/library
Thanks, this makes more sense.
So if one wanted to add support for Java17 metrics without alternate implementations (e.g. jvm.cpu.longlock
, jvm.network.io
) without the use of JFR, what would be the best way to do that?
you could add them to https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/runtime-telemetry/runtime-telemetry-java17/library
the runtime-telemetry-java17
module is for anything that's only supported in Java 17 and later (doesn't have to be JFR-based)
Is your feature request related to a problem? Please describe.
Java17 support inculdes additional metrics such as CPU_COUNT_METRICS and LOCK_METRICS described here. On OpenJ9 JVM these metrics are not available as the runtime-telemetry-java17 support depends on JFR streaming capabilities which is based on Hotspot JVM.
Describe the solution you'd like
I would like to ask if the runtime-telemetry community is open to working with the OpenJ9 Community to provide support for the metrics introduced in java17. The high level idea would be for OpenJ9 to provide some additional MXBean APIs that supply the required data for the new metrics. These would then be integrated in runtime-telemetry-java17.
Describe alternatives you've considered
One alternative is to create a Runtime-telemetry agent for OpenJ9. But this approach is not preferred.
Additional context
Please let me know if this is not the correct place to raise this request.