Issue with virtual-threads

vmaleze commented 11 months ago

Describe the bug

The java-agent locks any application using virtual-threads upon an external call if you add the -Djdk.tracePinnedThreads=short to the VM options.

Steps to reproduce

Create a fresh application using either spring or micronaut. Enable virtual-threads and add -Djdk.tracePinnedThreads=short to the VM options Add an external call to a BDD done through a rest controller. (Im guessing that any external call will do that, but I only have a redis or a mysql db to test)

Expected behavior

The application should not lock itself by adding -Djdk.tracePinnedThreads=short to the VM options

Actual behavior

The http call is stuck forever and the application is also stuck with no other way than to restart it.

Javaagent or library instrumentation version

1.32.0

Environment

JDK: 21 OS: Ubuntu 22.04 / MacOs 14.2.1 Spring Boot: 3.2.1 / Micronaut: 4.2.3

Additional context

I have a fresh micronaut application (version 4.2.3) that runs on java 21 with virtual-threads enabled. This application also uses the opentelemetry java agent (version 1.32.0)

To debug pinned threads, I added -Djdk.tracePinnedThreads=short . This works fine on a vanilla application with a hello-world controller. However upon adding a call to an external sytem, the application locks itself in a state where it does not respond anymore. I tried calling a redis cache or a mysql database, but the results ends-up being the same. This also happens on a spring application.

If I remove the java-agent, or disable virtual-threads, I don't have any problem.

I first noticed this on another application that I migrated towards java 21. Upon stress test, the service ended up being stucked. I suspected virtual-threads so I wanted to monitor pinned threads. And upon adding the jvm param. The application locks itself on the first call. I am unable to find the actual root cause. But it seems to be coming from the java-agent upon calling an external resource. If the jvm param is not added, the call succeed. But it will fails over time during stress tests. However, this is hard to reproduce if you just do stress tests as it can happen after 5 or 6 minutes whereas with this VM option, it fails right away

trask commented 11 months ago

hi @vmaleze! to speed up investigation, can you create a repository with steps on how to build and run to reproduce the issue?

vmaleze commented 10 months ago

Hi @trask, here it is => https://github.com/vmaleze/otel-issue-virtual-threads

If you need more from me, don't hesitate. I'm willing to help if I can contribute as well :)

anuragagarwal561994 commented 10 months ago

@trask even I have seen this issue in our production, if I remove javaagent everything works fine. I had earlier thought to enable tracePinnedThreads on one in production and keep it running so that I get an update if some code push leads to thread pinning in future. But since the pod was using javaagent as well it was stuck so I had to drop that idea as of now

trask commented 10 months ago

@JonasKunz since you've done some investigation of this topic previously in https://github.com/open-telemetry/opentelemetry-java-instrumentation/issues/9534#issuecomment-1731088793, would you be able to look at this?

JonasKunz commented 10 months ago

I'll have a look

JonasKunz commented 10 months ago

First of all thanks @vmaleze for providing the project to reproduce, this has made life very much easier.

Unfortunately the Issue is a pain to debug, because the entire JVM becomes unresponsive:

Debugging or jcmd cease to work (the JVM doesn't respond anymore)
jstack still works, but only provides the stacktrace of the carrier-thread, which isn't useful

I've been able to further simplify the provided project to a standalone test-case verifying that this is actually a JVM bug:


public class AgentMain {
    public static void premain(String agentArgs, Instrumentation inst) {
        System.out.println("Registering No-Op transformer");
        inst.addTransformer(new ClassFileTransformer() {
            @Override
            public byte[] transform(Module module, ClassLoader loader, String className, Class<?> classBeingRedefined, ProtectionDomain protectionDomain, byte[] classfileBuffer) throws IllegalClassFormatException {
                return null;
            }
        });
    }
}


public class Main {

    public static void main(String[] args) {
        try (var executor = Executors.newVirtualThreadPerTaskExecutor()) {
            Object lock = new Object();
            executor.submit(() -> {
                synchronized (lock) {
                    try {
                        System.out.println("Entering sleep ");
                        Thread.sleep(100);
                        System.out.println("Done sleep ");
                    } catch (InterruptedException e) {
                        throw new RuntimeException(e);
                    }
                }
            });
        }
    }
}

Running the provided Main with an agent containing just the AgentMain (without Otel!) and -Djdk.tracePinnedThreads=short causes the JVM to freeze. The virtual thread never wakes up from its sleep. I'll open a JVM bug and link it here.

anuragagarwal561994 commented 10 months ago

@JonasKunz I have also been seeing the application consuming more CPU with javaagent and virtual threads, reducing application throughput with virtual threads, although I am still investigating. Is there any issue that is already addressing this issue?

For now I have few observations in environment with 16 virtual core (8 physical core VM):

virtual threads + otel with ActiveProcessCount = 8 (performs decently & load average is also closer to cpu utilisation)
virtual threads + otel with ActiveProcessCount = 16 (has higher load average, closer to double than the cpu utilisation)
virtual threads without otel & ActiveProcesssCount = 8,16 (behaves almost similarly, in-fact opposed to the above 16 core is performing better in terms of latency)

Although these are very abstract observations I understand that, but I am just trying to understand if this is expected so I can channelise my investigation in focused direction.

JonasKunz commented 10 months ago

@anuragagarwal561994 I don't think this is related to this issue, though it's best to wait for a JVM fix for this issue to ship to try again. If you are then still seeing abnormal CPU usage behaviour, it's best to add a CPU profiler to know where time is actually spent.

anuragagarwal561994 commented 10 months ago

How to track via CPU profiler, I think the profiler doesn't report javaagent methods?

JonasKunz commented 10 months ago

I think the profiler doesn't report javaagent methods?

Are you referring to the fact that we inline Advices when doing instrumentation? This means that the Advice code itself won't appear in profiling stack traces, yes. However, the code called from Advices (e.g. SDK methods or helpers) will appear in them. I would guess that the Advice code itself is not the hotspot, I'd rather expect the code called by them to be responsible.

Btw this will be another benefit of #8999 : The Advices won't be inlined anymore, so they appear as normal method calls in profiling.

JonasKunz commented 9 months ago

Here's the JDK bug, apparently it has been closed as a duplicate. @trask do you want to close this issue, because the root cause is actually not related to the Otel agent?

Moscagus commented 9 months ago

Hi @trask, I have an application in java21 that makes intensive use of virtual threads. After the stress test, ended up being stucked. After an investigation I verified that the problem came from the java opentelemetry agent, even in its latest version 2.1.0. The problem comes because the agent leaves the carrier-thread in an inconsistent state.

jcmd-otel.txt jstack-otel.txt

1 - "jstack-otel.txt" show the problem. For example: ... "ForkJoinPool-1-worker-1" #134 [271] daemon prio=5 os_prio=0 cpu=70474.35ms elapsed=722.35s allocated=14189M defined_classes=272 tid=0x000055d55a76ef20 [0x00007f75af255000] Carrying virtual thread #134 --> Bug: same TID as carrier ..

2 - "jcmd-otel.txt" show the stack with problem. For example: .. "tid": "159185", "name": "virtual-Tn3EventsConsumerService-158213", "stack": [ "java.base\/jdk.internal.misc.Unsafe.park(Native Method)", --> Bug: VT use java.lang.VirtualThread.park, no jdk.internal.misc.Unsafe.park "java.base\/java.util.concurrent.locks.LockSupport.park(LockSupport.java:221)", "java.base\/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)", "java.base\/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:990)", "java.base\/java.util.concurrent.locks.ReentrantLock$Sync.lock(ReentrantLock.java:153)", "java.base\/java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:322)", "java.base\/java.lang.ref.ReferenceQueue.poll(ReferenceQueue.java:182)", **"io.opentelemetry.javaagent.shaded.instrumentation.api.internal.cache.weaklockfree.AbstractWeakConcurrentMap.expungeStaleEntries(AbstractWeakConcurrentMap.java:235)", --> opentelemetry java agent "io.opentelemetry.javaagent.shaded.instrumentation.api.internal.cache.weaklockfree.WeakConcurrentMap$WithInlinedExpunction.getIfPresent(WeakConcurrentMap.java:193)", "io.opentelemetry.javaagent.shaded.instrumentation.api.internal.cache.WeakLockFreeCache.get(WeakLockFreeCache.java:26)", "io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$java$util$concurrent$Future$io$opentelemetry$javaagent$bootstrap$executors$PropagatedContext.mapGet(VirtualFieldImplementationsGenerator.java:298)", "io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$java$util$concurrent$Future$io$opentelemetry$javaagent$bootstrap$executors$PropagatedContext.realGet(VirtualFieldImplementationsGenerator.java)", "io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$java$util$concurrent$Future$io$opentelemetry$javaagent$bootstrap$executors$PropagatedContext.get(VirtualFieldImplementationsGenerator.java:280)", "io.opentelemetry.javaagent.bootstrap.executors.ExecutorAdviceHelper.cleanPropagatedContext(ExecutorAdviceHelper.java:92)", "java.base\/java.util.concurrent.FutureTask.cancel(FutureTask.java:181)", "java.base\/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.cancel(ScheduledThreadPoolExecutor.java:291)", "java.base\/java.lang.VirtualThread.cancel(VirtualThread.java:705)", "java.base\/java.lang.VirtualThread.parkNanos(VirtualThread.java:628)", ...

3 - VirtualThread.java:**** https://github.com/openjdk/loom/blob/fibers/src/java.base/share/classes/java/lang/VirtualThread.java#L681 https://github.com/openjdk/loom/blob/fibers/src/java.base/share/classes/java/lang/VirtualThread.java#L767

@Override
void parkNanos(long nanos) {
    ...
    if (nanos > 0) {
        ...
        try {
            yielded = yieldContinuation();  // may throw
        } finally {
            ...
            cancel(unparker);
        }
        ...
        }
    }
}

@ChangesCurrentThread
private void cancel(Future<?> future) {
    ...
    if (!future.isDone()) {                --------> When this condition occurs, the problem is generated
        // need to switch to current carrier thread to avoid nested parking
        switchToCarrierThread();     --------> Here the carrier TID and the carrying TID are the same as shown in point 1 (jstack-otel.txt)
        try {
            future.cancel(false);        --------> Here the opentelemetry java agent begins to act (Bug).
        } finally {
            switchToVirtualThread(this);  ---------> It never comes here
        }
    }
}

4 - AbstractWeakConcurrentMap.java https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/release/v2.1.x/instrumentation-api/src/main/java/io/opentelemetry/instrumentation/api/internal/cache/weaklockfree/AbstractWeakConcurrentMap.java#L232

public static void expungeStaleEntries() {
  Reference<?> reference;
  while ((reference = REFERENCE_QUEUE.poll()) != null) {
     removeWeakKey((WeakKey<?>) reference);
  }
}

REFERENCE_QUEUE.poll() -------> poll() that generates jdk.internal.misc.Unsafe.park, thus achieving a stuck in the application since as explained in point 1: "ForkJoinPool-1-worker-1" #134 [271] daemon prio=5 os_prio=0 cpu=70474.35ms elapsed=722.35s allocated=14189M defined_classes=272 tid=0x000055d55a76ef20 [0x00007f75af255000] Carrying virtual thread #134 --> Bug: same TID as carrier

AlexElin commented 7 months ago

Here's the JDK bug, apparently it has been closed as a duplicate. @trask do you want to close this issue, because the root cause is actually not related to the Otel agent?

Note that the bug JDK-8322846 has been fixed

laurit commented 6 months ago

closing as the jdk bug has been fixed

open-telemetry / opentelemetry-java-instrumentation