SmallRyeThreadContext's close can use custom cleanup strategies

franz1981 commented 1 year ago

I've spotted this In few reactive benchmarks using quarkus which doesn't disable context propagation (fairly commons for users that don't change default I believe): from what i can see there (and the code confirm it) the behaviour of SmallRyeThreadContext on close is to call ThreadLocal's removal, in order to save leaks to happen.

Sadly ThreadLocal::remove have a huge impact perf wise due to https://github.com/openjdk/jdk19/blob/967a28c3d85fdde6d5eb48aa0edd8f7597772469/src/java.base/share/classes/java/lang/ThreadLocal.java#L570

which end up calling Reference::clear via

https://github.com/openjdk/jdk19/blob/967a28c3d85fdde6d5eb48aa0edd8f7597772469/src/java.base/share/classes/java/lang/ThreadLocal.java#L368

that can be seen in this flamegraph (in violet)

and at a deeper level

which show the interaction with the runtime to coordinate an async cleanup (it's a weak ref :"/ sob)

Given that such cost happen in the I/O thread it can be pretty severe and should be better saving it.

I order to improve this, would be great to be able to configure SmallRyeThreadContext via a lambda (once or more time) which can allow an external framework to perform some check to guide the close operation to perform a much cheaper set(null) (which doesn't remove the entry) or full fat remove as it is now: in the quarkus use case I think that if such lambda can query if the context of execution is within an I/O event loop thread (not just on a vertx ctx!), there won't be any need to remove the thread local entirely, because I/O threads will be alive for the whole application life-time.

franz1981 commented 1 year ago

Another (additional) option could be to let users provide their own Thread local like impl (with some public API) and quarkus can even decide to mix strategies:

by checking while running on I/O threads
replacing thread local with Netty's fast thread local ones (that would speedup A LOT usual thread local ops eg set/get)

FroMage commented 1 year ago

I already have support for custom thread locals, but nobody's ever dared to use them.

So, why don't we always call set(null) instead of remove() then if it's much cheaper?

franz1981 commented 1 year ago

Hi @FroMage this https://github.com/spring-cloud/spring-cloud-sleuth/issues/27#issue-101694560 explain in deep details what happen if remove isn't used

in short:

ThreadLocal instance put into the Thread's ThreadLocalMap an (weak reference) entry to itself
if the ctx propag ThreadLocal (which is a GC root in the class loader it belongs) won't be around anymore, such weak reference would drag for longer time (it's a weak ref!) the whole class loader too

In quarkus we don't care about it (please correct me if i'm wrong), because our class loader will be around for the whole application's life, meaning that we have no business into disconnect it's life duration from the one of the Thread referencing it (transitively, via the ThreadLocal).

The same reasoning can still hold to non I/O threads too, to be honest (so my previous statement seems inaccurate).

The sole difference of I/O threads vs blocking ones is that the former will last for the whole application duration as well (as the classloader and likely the context propagation usage too), hence the only impact is the presence of such ThreadLocal map entry ie a memory footprint negligible concern (few bytes).

What's dangerous, as mentioned in the issue I've referenced, is when classloaders life-time isn't the same of the application they belong, which can happen on wildfly, for example.

FroMage commented 1 year ago

OK, so it could be an option. But the problem is that most ThreadContextProvider implementations are third-party plugins, so they'd also have to be updated to stop clearing the ThreadLocal.

Frankly I think this all should go via the new "Storage" API, but I don't have the time to drive this :(

franz1981 commented 1 year ago

Why not using a sys property to decide which of the two behaviours? This will let advanced users which knows their own classloader to communicate their (static const) preference without affecting the default behaviour (using remove)

geoand commented 1 year ago

In quarkus we don't care about it (please correct me if i'm wrong), because our class loader will be around for the whole application's life, meaning that we have no business into disconnect it's life duration from the one of the Thread referencing it (transitively, via the ThreadLocal).

Actually, this is not the case in dev-mode - in dev-mode user code is loaded via a ClassLoader that is dropped when the application is reloaded. Same for tests actually.

franz1981 commented 1 year ago

That means that while compiled in dev/tests "mode" we have to disable such "feature", using TL::remove @geoand ?

user code is loaded via a ClassLoader that is dropped when the application is reloaded

Beware: it just means that the weak reference to the thread local map into the thread would still follow the weak reference GC rules and lifecycle ie they are just "temporary" leaks, meaning that will be around for more then necessary and would require both more heap and GC activity before cleaned up (likely full GCs).

Please let me know if you see other options in order to achieve it

geoand commented 1 year ago

That means that while compiled in dev/tests "mode" we have to disable such "feature", using TL::remove @geoand ?

Yes

franz1981 commented 1 year ago

One note @FroMage it seems that Netty's FastThreadLocal is not making use of weak ref on the remove path, I'm evaluating if it's "good enough" for our use case

@geoand too -> meaning we can just use the "custom thread local" feature of this repo in quarkus and use fast thread locals from there, on both dev/test/prod scenarios (they should just work)

geoand commented 1 year ago

This confuses me, don't we already use FastThreadLocal as a result of using Vert.x?

franz1981 commented 1 year ago

@geoand not for this specific one I can tell.

I've performed a micro benchmark to understand what are the others implications of NOT using fast thread locals for remove by using the benchmark at https://github.com/franz1981/java-puzzles/blob/main/src/main/java/red/hat/puzzles/concurrent/ThreadLocalScalability.java with -DthreadLocals=1 on JDK 19.

-t 1

Benchmark                                     Mode  Cnt          Score        Error  Units
ThreadLocalScalability.setGetNullout         thrpt   10  109481209.044 ± 464386.775  ops/s
ThreadLocalScalability.setGetRemove          thrpt   10   14307019.589 ±  49020.979  ops/s

-prof perfnorm of the remove case:

Benchmark                                                   Mode  Cnt         Score        Error      Units
ThreadLocalScalability.setGetRemove                        thrpt   10  14063430.793 ± 265518.974      ops/s
ThreadLocalScalability.setGetRemove:CPI                    thrpt    2         0.443               clks/insn
ThreadLocalScalability.setGetRemove:IPC                    thrpt    2         2.255               insns/clk
ThreadLocalScalability.setGetRemove:L1-dcache-load-misses  thrpt    2         0.553                    #/op
ThreadLocalScalability.setGetRemove:LLC-load-misses        thrpt    2         0.005                    #/op
ThreadLocalScalability.setGetRemove:LLC-loads              thrpt    2         0.031                    #/op
ThreadLocalScalability.setGetRemove:LLC-store-misses       thrpt    2         0.020                    #/op
ThreadLocalScalability.setGetRemove:LLC-stores             thrpt    2         0.059                    #/op
ThreadLocalScalability.setGetRemove:cycles                 thrpt    2       184.114                    #/op
ThreadLocalScalability.setGetRemove:dTLB-load-misses       thrpt    2         0.009                    #/op
ThreadLocalScalability.setGetRemove:instructions           thrpt    2       415.255                    #/op

-t 4

Benchmark                                     Mode  Cnt          Score          Error  Units
ThreadLocalScalability.setGetNullout         thrpt   10  432425222.667 ±  7785949.059  ops/s
ThreadLocalScalability.setGetRemove          thrpt   10   38913911.834 ± 24150978.392  ops/s

-prof perfnorm of the remove case:

Benchmark                                                   Mode  Cnt         Score          Error      Units
ThreadLocalScalability.setGetRemove                        thrpt   10  38707328.433 ± 28134808.681      ops/s
ThreadLocalScalability.setGetRemove:CPI                    thrpt    2         0.808                 clks/insn
ThreadLocalScalability.setGetRemove:IPC                    thrpt    2         1.558                 insns/clk
ThreadLocalScalability.setGetRemove:L1-dcache-load-misses  thrpt    2         1.603                      #/op
ThreadLocalScalability.setGetRemove:LLC-load-misses        thrpt    2         0.002                      #/op
ThreadLocalScalability.setGetRemove:LLC-loads              thrpt    2         0.734                      #/op
ThreadLocalScalability.setGetRemove:LLC-store-misses       thrpt    2         0.020                      #/op
ThreadLocalScalability.setGetRemove:LLC-stores             thrpt    2         0.655                      #/op
ThreadLocalScalability.setGetRemove:cycles                 thrpt    2       338.906                      #/op
ThreadLocalScalability.setGetRemove:dTLB-load-misses       thrpt    2         0.453                      #/op
ThreadLocalScalability.setGetRemove:instructions           thrpt    2       417.949                      #/op

what the numbers shows is that setNull have near linear scalability (and perfnorm agree to have the same IPC for both single/four threads) while remove doesn't: it is causing IPC to be halved and there's an increased activity of both L1 and LLC load-misses, likely the cause of such slowdown. This is likely due to the interaction with the reference queue; in short, remove is not just 1 order of magnitude slower, but it can cause some scalability issue.

NOTE In the bench I'm causing high contention over the same thread local instance: I'll try with many in a few to see if that change AND G1GC seems to be the most offended GC here, ParallelGC instead got a much reduced impact.

Going to check what happen with fast thread locals in a few.

geoand commented 1 year ago

That's a pretty big performance impact!

In Quarkus prod-mode, AFAIR, we use VertxThread for the worker pool as well, so we should always be using Netty threads, thus it makes complete sense to try and use FastThreadLocal

franz1981 commented 1 year ago

Ok @Sanne and @geoand so the proper fixes are:

use FastThreadLocal(s) on quarkus for this AND use remove: this would grant less performance but no scalability impact (still better perf than JDK ThreadLocal(s) and no weak refs!)
use a sys prop to control if use remove or not (set(null) is just faster)

The 2 things could happen separately but ideally we want both

geoand commented 1 year ago

That sounds reasonable to me.

Sanne commented 1 year ago

Awesome! Ok we could implement this as two different flags, but I do wonder if we're not better off to inject a custom Quarkus implementation of some SPI which applies both tricks. (I understand such an SPI doesn't exist yet, so just wondering).

geoand commented 1 year ago

I do wonder if we're not better off to inject a custom Quarkus implementation of some SPI which applies both tricks. (I understand such an SPI doesn't exist yet, so just wondering).

Yeah, I had the same thought, but I can't find where I wrote it :)

FroMage commented 1 year ago

@franz1981 The original idea behind StorageDeclaration is for ThreadContextProvider to simply declare their TL storages to SR-CP (see https://github.com/smallrye/smallrye-context-propagation/blob/main/testsuite/extra/src/main/java/io/smallrye/context/test/MyThreadContextProvider.java) and then CP, instead of calling the providers, will simply call set on all the ThreadLocals and make context switching super fast.

Then, there's another thing designed to make thread locals super fast, which is that instead of handing out ThreadLocal instances to the libs, we collect all declared StorageDeclaration and create one field for each in custom threads, and override the Vert.x thread factory so that all Vert.x threads have extra fields. See https://github.com/smallrye/smallrye-context-propagation/blob/main/storage/src/main/java/io/smallrye/context/storage/spi/StorageManager.java

So the idea is that, for example, you have a lib like narayana, which needs a ThreadLocal for the current transaction:

Instead of doing ThreadLocal<Transaction> tl = new ThreadLocal<>() you do ThreadLocal<Transaction> tl = StorageManager.threadLocal(Declaration.class);
You declare static class Declaration implements StorageDeclaration<Transaction>{}
Quarkus collects all the known StorageDeclaration and creates a subclass of VertxThread: public class QuarkusVertxThread extends VertxThread { Transaction transaction; }
We register a ThreadFactory so that Vert.x creates those threads only

We create a ThreadLocal subtype:

class TransactionThreadLocal extends ThreadLocal<Transaction> { 
public void set(Transaction tx) {
Thread t = Thread.currentThread();
if(t instanceof QuarkusVertxThread) {
((QuarkusVertxThread)t).transaction = tx;
} else {
super.set(tx);
}
}
// … etc
}

And then context switching and thread locals become fast.

All this was already implemented years ago in branches, but nobody had time to test and benchmark it. I still don't have time, but I'd be happy to help you along :)

See:

https://github.com/quarkusio/quarkus/compare/main...FroMage:quarkus:storage for the Vert.x+Quarkus side as well as ArC RESTEasy classic, and Narayana
https://github.com/jbosstm/narayana/compare/main...FroMage:narayana:storage for Narayana
https://github.com/resteasy/resteasy/commit/5bdc7a6892cb72099a01bcc557b32dcd6a50bae8 for RESTEasy Classic

Let me know if you're interested.

franz1981 commented 1 year ago

Thanks @FroMage I will go through your comment to check if can solve this (avoid remove and using the faster set(null) for us: I wasn't fully convinced by the noise in the micro-benchmark and I have involved the Jdk team to review what I believe to have noticed related remove. Anyway given that remove perform atomic operations and recreate the entry weak reference each time, it's still valid to move to a cheaper set (null) when possible ie prod class loader on Quarkus, regardless which thread local impl we choose.

franz1981 commented 1 year ago

I will likely experiment with a patch as you have suggested in the next weeks and it is indeed the best solution, perf-wise, although requires few code changes.

At the same time, Netty fast thread local is very similar but less invasive: differently from Java ThreadLocals, they would perform a simple array lookup in the FastThread instance. And, without introducing any "additional" dependency (ie Netty), but leveraging an already existing one, it doesn't involve many changes in Quarkus (a one liner? Maybe 2-3). Performing a micro-benchmark to compare it against a single field lookup solution should be trivial, will do it next week.

I will try schedule a call next week so we can decide together, given pro/cons of each solution.

smallrye / smallrye-context-propagation

SmallRyeThreadContext's close can use custom cleanup strategies #424