Closed vmaleze closed 6 months ago
hi @vmaleze! to speed up investigation, can you create a repository with steps on how to build and run to reproduce the issue?
Hi @trask, here it is => https://github.com/vmaleze/otel-issue-virtual-threads
If you need more from me, don't hesitate. I'm willing to help if I can contribute as well :)
@trask even I have seen this issue in our production, if I remove javaagent everything works fine. I had earlier thought to enable tracePinnedThreads
on one in production and keep it running so that I get an update if some code push leads to thread pinning in future. But since the pod was using javaagent as well it was stuck so I had to drop that idea as of now
@JonasKunz since you've done some investigation of this topic previously in https://github.com/open-telemetry/opentelemetry-java-instrumentation/issues/9534#issuecomment-1731088793, would you be able to look at this?
I'll have a look
First of all thanks @vmaleze for providing the project to reproduce, this has made life very much easier.
Unfortunately the Issue is a pain to debug, because the entire JVM becomes unresponsive:
jcmd
cease to work (the JVM doesn't respond anymore)jstack
still works, but only provides the stacktrace of the carrier-thread, which isn't usefulI've been able to further simplify the provided project to a standalone test-case verifying that this is actually a JVM bug:
public class AgentMain {
public static void premain(String agentArgs, Instrumentation inst) {
System.out.println("Registering No-Op transformer");
inst.addTransformer(new ClassFileTransformer() {
@Override
public byte[] transform(Module module, ClassLoader loader, String className, Class<?> classBeingRedefined, ProtectionDomain protectionDomain, byte[] classfileBuffer) throws IllegalClassFormatException {
return null;
}
});
}
}
public class Main {
public static void main(String[] args) {
try (var executor = Executors.newVirtualThreadPerTaskExecutor()) {
Object lock = new Object();
executor.submit(() -> {
synchronized (lock) {
try {
System.out.println("Entering sleep ");
Thread.sleep(100);
System.out.println("Done sleep ");
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
});
}
}
}
Running the provided Main
with an agent containing just the AgentMain
(without Otel!) and -Djdk.tracePinnedThreads=short
causes the JVM to freeze. The virtual thread never wakes up from its sleep. I'll open a JVM bug and link it here.
@JonasKunz I have also been seeing the application consuming more CPU with javaagent and virtual threads, reducing application throughput with virtual threads, although I am still investigating. Is there any issue that is already addressing this issue?
For now I have few observations in environment with 16 virtual core (8 physical core VM):
Although these are very abstract observations I understand that, but I am just trying to understand if this is expected so I can channelise my investigation in focused direction.
@anuragagarwal561994 I don't think this is related to this issue, though it's best to wait for a JVM fix for this issue to ship to try again. If you are then still seeing abnormal CPU usage behaviour, it's best to add a CPU profiler to know where time is actually spent.
How to track via CPU profiler, I think the profiler doesn't report javaagent methods?
I think the profiler doesn't report javaagent methods?
Are you referring to the fact that we inline Advices when doing instrumentation? This means that the Advice code itself won't appear in profiling stack traces, yes. However, the code called from Advices (e.g. SDK methods or helpers) will appear in them. I would guess that the Advice code itself is not the hotspot, I'd rather expect the code called by them to be responsible.
Btw this will be another benefit of #8999 : The Advices won't be inlined anymore, so they appear as normal method calls in profiling.
Here's the JDK bug, apparently it has been closed as a duplicate. @trask do you want to close this issue, because the root cause is actually not related to the Otel agent?
Hi @trask, I have an application in java21 that makes intensive use of virtual threads. After the stress test, ended up being stucked. After an investigation I verified that the problem came from the java opentelemetry agent, even in its latest version 2.1.0. The problem comes because the agent leaves the carrier-thread in an inconsistent state.
1 - "jstack-otel.txt" show the problem. For example: ... "ForkJoinPool-1-worker-1" #134 [271] daemon prio=5 os_prio=0 cpu=70474.35ms elapsed=722.35s allocated=14189M defined_classes=272 tid=0x000055d55a76ef20 [0x00007f75af255000] Carrying virtual thread #134 --> Bug: same TID as carrier ..
2 - "jcmd-otel.txt" show the stack with problem. For example: .. "tid": "159185", "name": "virtual-Tn3EventsConsumerService-158213", "stack": [ "java.base\/jdk.internal.misc.Unsafe.park(Native Method)", --> Bug: VT use java.lang.VirtualThread.park, no jdk.internal.misc.Unsafe.park "java.base\/java.util.concurrent.locks.LockSupport.park(LockSupport.java:221)", "java.base\/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)", "java.base\/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:990)", "java.base\/java.util.concurrent.locks.ReentrantLock$Sync.lock(ReentrantLock.java:153)", "java.base\/java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:322)", "java.base\/java.lang.ref.ReferenceQueue.poll(ReferenceQueue.java:182)", **"io.opentelemetry.javaagent.shaded.instrumentation.api.internal.cache.weaklockfree.AbstractWeakConcurrentMap.expungeStaleEntries(AbstractWeakConcurrentMap.java:235)", --> opentelemetry java agent "io.opentelemetry.javaagent.shaded.instrumentation.api.internal.cache.weaklockfree.WeakConcurrentMap$WithInlinedExpunction.getIfPresent(WeakConcurrentMap.java:193)", "io.opentelemetry.javaagent.shaded.instrumentation.api.internal.cache.WeakLockFreeCache.get(WeakLockFreeCache.java:26)", "io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$java$util$concurrent$Future$io$opentelemetry$javaagent$bootstrap$executors$PropagatedContext.mapGet(VirtualFieldImplementationsGenerator.java:298)", "io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$java$util$concurrent$Future$io$opentelemetry$javaagent$bootstrap$executors$PropagatedContext.realGet(VirtualFieldImplementationsGenerator.java)", "io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$java$util$concurrent$Future$io$opentelemetry$javaagent$bootstrap$executors$PropagatedContext.get(VirtualFieldImplementationsGenerator.java:280)", "io.opentelemetry.javaagent.bootstrap.executors.ExecutorAdviceHelper.cleanPropagatedContext(ExecutorAdviceHelper.java:92)", "java.base\/java.util.concurrent.FutureTask.cancel(FutureTask.java:181)", "java.base\/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.cancel(ScheduledThreadPoolExecutor.java:291)", "java.base\/java.lang.VirtualThread.cancel(VirtualThread.java:705)", "java.base\/java.lang.VirtualThread.parkNanos(VirtualThread.java:628)", ...
3 - VirtualThread.java:**** https://github.com/openjdk/loom/blob/fibers/src/java.base/share/classes/java/lang/VirtualThread.java#L681 https://github.com/openjdk/loom/blob/fibers/src/java.base/share/classes/java/lang/VirtualThread.java#L767
@Override
void parkNanos(long nanos) {
...
if (nanos > 0) {
...
try {
yielded = yieldContinuation(); // may throw
} finally {
...
cancel(unparker);
}
...
}
}
}
@ChangesCurrentThread
private void cancel(Future<?> future) {
...
if (!future.isDone()) { --------> When this condition occurs, the problem is generated
// need to switch to current carrier thread to avoid nested parking
switchToCarrierThread(); --------> Here the carrier TID and the carrying TID are the same as shown in point 1 (jstack-otel.txt)
try {
future.cancel(false); --------> Here the opentelemetry java agent begins to act (Bug).
} finally {
switchToVirtualThread(this); ---------> It never comes here
}
}
}
4 - AbstractWeakConcurrentMap.java https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/release/v2.1.x/instrumentation-api/src/main/java/io/opentelemetry/instrumentation/api/internal/cache/weaklockfree/AbstractWeakConcurrentMap.java#L232
public static void expungeStaleEntries() {
Reference<?> reference;
while ((reference = REFERENCE_QUEUE.poll()) != null) {
removeWeakKey((WeakKey<?>) reference);
}
}
REFERENCE_QUEUE.poll() -------> poll() that generates jdk.internal.misc.Unsafe.park, thus achieving a stuck in the application since as explained in point 1: "ForkJoinPool-1-worker-1" #134 [271] daemon prio=5 os_prio=0 cpu=70474.35ms elapsed=722.35s allocated=14189M defined_classes=272 tid=0x000055d55a76ef20 [0x00007f75af255000] Carrying virtual thread #134 --> Bug: same TID as carrier
Here's the JDK bug, apparently it has been closed as a duplicate. @trask do you want to close this issue, because the root cause is actually not related to the Otel agent?
Note that the bug JDK-8322846 has been fixed
closing as the jdk bug has been fixed
Describe the bug
The java-agent locks any application using virtual-threads upon an external call if you add the
-Djdk.tracePinnedThreads=short
to the VM options.Steps to reproduce
Create a fresh application using either spring or micronaut. Enable virtual-threads and add
-Djdk.tracePinnedThreads=short
to the VM options Add an external call to a BDD done through a rest controller. (Im guessing that any external call will do that, but I only have a redis or a mysql db to test)Expected behavior
The application should not lock itself by adding
-Djdk.tracePinnedThreads=short
to the VM optionsActual behavior
The http call is stuck forever and the application is also stuck with no other way than to restart it.
Javaagent or library instrumentation version
1.32.0
Environment
JDK: 21 OS: Ubuntu 22.04 / MacOs 14.2.1 Spring Boot: 3.2.1 / Micronaut: 4.2.3
Additional context
I have a fresh micronaut application (version 4.2.3) that runs on java 21 with virtual-threads enabled. This application also uses the opentelemetry java agent (version 1.32.0)
To debug pinned threads, I added
-Djdk.tracePinnedThreads=short
. This works fine on a vanilla application with a hello-world controller. However upon adding a call to an external sytem, the application locks itself in a state where it does not respond anymore. I tried calling a redis cache or a mysql database, but the results ends-up being the same. This also happens on a spring application.If I remove the java-agent, or disable virtual-threads, I don't have any problem.
I first noticed this on another application that I migrated towards java 21. Upon stress test, the service ended up being stucked. I suspected virtual-threads so I wanted to monitor pinned threads. And upon adding the jvm param. The application locks itself on the first call. I am unable to find the actual root cause. But it seems to be coming from the java-agent upon calling an external resource. If the jvm param is not added, the call succeed. But it will fails over time during stress tests. However, this is hard to reproduce if you just do stress tests as it can happen after 5 or 6 minutes whereas with this VM option, it fails right away