trask commented 10 months ago

Virtual threads are out of preview and included in new Java 21 (LTS) release (https://openjdk.org/jeps/444).

Notes/references from weekly Java meeting:

https://quarkus.io/blog/virtual-thread-1/#five-things-you-need-to-know-before-using-virtual-threads-for-everything
https://github.com/elastic/apm-agent-java/issues/3059
- ThreadLocals are still supported, but can be more memory intensive for things like caches of non-thread-safe objects (e.g. SimpleDateFormat) now that there can be huge numbers of virtual threads
There is a new executor service which needs to be instrumented

JonasKunz commented 10 months ago

Here's my understanding on this topic, I've implemented the virtual thread support in the elastic apm agent.

Thread Locals

The usage of thread locals prior to virtual threads can be roughly divided into two categories:

A ThreadLocalis used as a way of implicitly passing values up or down the callstack. E.g. the OpenTelemetry Context is a prime example for this usecase.
A ThreadLocal is used to pool an expensive, non-thread safe resource.

The first use case is still fine with virtual threads: It is just a necessary tool and the lifetime of the thread local is usually limited (e.g. via the closeable Scope).

The second use case is the one which can be problematic. A good example is the TemporaryBuffers class from the SDK. The buffer is allocated on first usage in a Thread and from then on lives until the thread dies.

With non-virtual threads this was super nice, especially because your typical web application would use a thread-pool of worker threads. After every thread in the pool has been used, TemporaryBuffers would be allocation free because the ThreadLocal has been initialized for every pooled thread.

With virtual threads the intended usage would be to not pool threads, but create a new virtual thread for each request. This means that for each request a new TemporaryBuffer is allocated and first usage. It could only be reused on further usage within the same request (=the same virtual thread). If your requests are short lived, this means that the pooling is somewhat ineffective but the memory consumption would still be okay.

Where things get really bad is if the ThreadLocal unnecessarily prolongs the lifetime of the pooled object. The best example here imo is a Java service uses as HTTP Gateway/Proxy. Previously you would typically use reactive programming here, whereas with virtual threads you can now easily go back to the thread-per-request model.

Imagine an HTTP Gateway which just forwards the HTTP requests to other services which take 10s on avergage. With 10k requests per second, this means that at any given time 100k requests are served concurrently, equating to 100k virtual threads mostly waiting on the response of the downstream HTTP service. If every virtual thread holds on to a 4kb TemporaryBuffer, this gives you about 400mb wasted heap memory.

Alternatives

Give up on pooling and live with the "expensive" resource creation (e.g. higher allocation rate)
- In the elastic agent we did this for some char[] pools which were only used once per handled HTTP request IIRC
Use a true Object-pool instead of shoehorning ThreadLocals to do this. This is not free though, as the thread-safe management usually requires atomic operations when acquiring / releasing Objects.
- In the elastic agent we use an object pool based on the MpmcAtomicArrayQueue from JCTools
Stay with ThreadLocal pooling if the pooled resource is very lightweight AND used very frequently, even during a single request
- In the elastic apm agent we use a special WeakConcurrentMap (GH). That map requires a special key object for lookups with overidden hashcode() and equals(). In order to not always having to allocate those, this special key object is stored in a ThreadLocal and reused for every get() on the map. Because that object is both very small and used very heavily, we stayed with a ThreadLocal in this case.
Caution, hack: The JVM seems to have an internal "carrier thread local" (see this discussion on the jackson GH repo). This would be super dangerous to use as you would have to guarantee that your virtual thread is not unmounted while you are holding onto the pooled resource. In the future this might be even impossible if they add preemption to virtual threads.

Suggestion for next steps

In my opinion it would make sense to go through both the Otel Agent and the SDK and to look at all usages of ThreadLocals.

If they are used for passing implict parameter and have a short lifetime: great, nothing todo!
If they are used for pooling decide which of the proposed alternatives would be the best fit

We (elastic) can definitely contribute the object pooling implementation and it's JMH benchmark suite to Otel if desired. We've been using this pooling implementation for a few years now to minimize allocation rates prior to virtual threads, so it is battle tested.

Executors

The newly added Executors are [Executors.newVirtualThreadPerTaskExecutor()](https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/util/concurrent/Executors.html#newVirtualThreadPerTaskExecutor()) and Executors.newThreadPerTaskExecutor(ThreadFactory). I think the former builds on the latter. Chances are high that our instrumentation already covers these, so maybe just adding tests for those should be good enough. This is at least how it was for the elastic agent.

JonasKunz commented 10 months ago

I looked for all usages of ThreadLocal in the API, SDK and Agent and tried to classify for which usages action might be required for handling virtual threads nicely.

Usages might needing work

API: TemporaryBuffers

All except one usages acquire a very small buffer (<64 chars, so <128 byte)
- SpanId parsing/generation (16 chars)
- TraceId parsing/generation (32 chars)
- W3CTraceContextPropagator (55 chars)
The only "large" buffer is allocated by PercentEscaper (1000 chars, so 2000 bytes)
Proposed Solution:
- Stay with thread locals for <64 chars for performance
- Use a char[] object pool for PercentEscaper. The two atomic operations for pool management shouldn't have too much overhead in relation to the percent escaping logic. It is only on a slow path anyway
- Not too urgent for now, as this code path is only entered when actually requiring encoding (e.g. special characters in propagated baggage)

SDK: JdkHttpSender

ThreadLocal<ByteBufferPool> is very heavy-weight
Should be okay though, because this code path is only executed for a private thread-pool with five threads
We would only need to adapt this if we plan to replace that thread-pool with virtual threads at some time, because then the pooling will become ineffective (but no memory leak due to prolonged lifetime)

SDK: ProtoSerializer

Uses ThreadLocal<OutputStreamEncoder> with 50kb buffer
ThreadLocal<Map<String, byte[]>> THREAD_LOCAL_ID_CACHE is fine because it is cleared on ProtoSerializer.close()
Assumption: ProtoSerializer is only called from very few exporter threads, correct? Then the ThreadLocals are fine.
- If we plan on ever replacing those Threads with virtual ones the pooling would become ineffective (but no memory leak due to prolonged lifetime)

For the agent I think no changes are needed, all usages seem fine to me!

Usages looking fine to me

I checked the following usages across API, SDK and Agent but classified them as non-problematic for virtual threads:

API:

ThreadLocalContextStorage: Cleared when no context is active

SDK:

WeakConcurrentMap: ThreadLocal<LookupKey<?>>
- Expected to be used very frequently, so an object pool would be bad for performance imo
- Object is very small, so okay to have as "deadweight" even for virtual threads

Agent:

Usages of ThreadLocal with a short lifespan (e.g. ThreadLocal is cleared after an instrumented method ends):
- Kafka Streams StateHolder
- JedisRequestContext
- Kubernetes client CurrentState
- Liberty ThreadLocalContext
- RestEasy OtelRequestContext
- RabbitMQ ThreadLocal<Context>
- RMI ThreadLocalContext
- AgentBuilderUtil.TransformContext
Usages of ThreadLocal which are not cleared before thread death, but the objects are very small:
- KafkaClientsConsumerProcessTracing
- Spring TracingChannelInterceptor ThreadLocal<Map<MessageChannel, ContextAndScope>>
- The map remains, but it is cleared
- SpringSchedulingTaskTracing
- Vert.X ThreadLocal<SqlConnectOptions>
- Not cleared after usage, therefore the lifetime of SqlConnectOptions is prolonged. However, I would assume that the number of actual SqlConnectOptions instances is low enough so that this doesn't matter.
- InClassLoaderMatcher ThreadLocal<boolean[]>
- DefineClassHandler ThreadLocal<DefineClassContextImpl>

TL;DR

All ThreadLocal usages in the agent are fine
Under the assumption that exporters are not using virtual threads we are good in the SDK
For the API we only need to look at the TemporaryBuffers usage by PercentEscaper.
- Imo it would be worth replacing this usage with an object-pool instead
- Not too urgent, as excessive memory usage would be an edge case even with lots of virtual threads

Some Feedback on whether you agree / disagree would be great!

open-telemetry / opentelemetry-java-instrumentation

Support for virtual threads (Java 21) #9534