Open franz1981 opened 1 year ago
Another (additional) option could be to let users provide their own Thread local like impl (with some public API) and quarkus can even decide to mix strategies:
I already have support for custom thread locals, but nobody's ever dared to use them.
So, why don't we always call set(null)
instead of remove()
then if it's much cheaper?
Hi @FroMage this https://github.com/spring-cloud/spring-cloud-sleuth/issues/27#issue-101694560 explain in deep details what happen if remove
isn't used
in short:
ThreadLocal
instance put into the Thread
's ThreadLocalMap
an (weak reference) entry to itself ThreadLocal
(which is a GC root in the class loader it belongs) won't be around anymore, such weak reference would drag for longer time (it's a weak ref!) the whole class loader tooIn quarkus we don't care about it (please correct me if i'm wrong), because our class loader will be around for the whole application's life, meaning that we have no business into disconnect it's life duration from the one of the Thread
referencing it (transitively, via the ThreadLocal
).
The same reasoning can still hold to non I/O threads too, to be honest (so my previous statement seems inaccurate).
The sole difference of I/O threads vs blocking ones is that the former will last for the whole application duration as well (as the classloader and likely the context propagation usage too), hence the only impact is the presence of such ThreadLocal
map entry ie a memory footprint negligible concern (few bytes).
What's dangerous, as mentioned in the issue I've referenced, is when classloaders life-time isn't the same of the application they belong, which can happen on wildfly, for example.
OK, so it could be an option. But the problem is that most ThreadContextProvider
implementations are third-party plugins, so they'd also have to be updated to stop clearing the ThreadLocal.
Frankly I think this all should go via the new "Storage" API, but I don't have the time to drive this :(
Why not using a sys property to decide which of the two behaviours? This will let advanced users which knows their own classloader to communicate their (static const) preference without affecting the default behaviour (using remove)
In quarkus we don't care about it (please correct me if i'm wrong), because our class loader will be around for the whole application's life, meaning that we have no business into disconnect it's life duration from the one of the Thread referencing it (transitively, via the ThreadLocal).
Actually, this is not the case in dev-mode - in dev-mode user code is loaded via a ClassLoader that is dropped when the application is reloaded. Same for tests actually.
That means that while compiled in dev/tests "mode" we have to disable such "feature", using TL::remove
@geoand ?
user code is loaded via a ClassLoader that is dropped when the application is reloaded
Beware: it just means that the weak
reference to the thread local map into the thread would still follow the weak
reference GC rules and lifecycle ie they are just "temporary" leaks, meaning that will be around for more then necessary and would require both more heap and GC activity before cleaned up (likely full GCs).
Please let me know if you see other options in order to achieve it
That means that while compiled in dev/tests "mode" we have to disable such "feature", using TL::remove @geoand ?
Yes
One note @FroMage it seems that Netty's FastThreadLocal
is not making use of weak ref on the remove path, I'm evaluating if it's "good enough" for our use case
@geoand too -> meaning we can just use the "custom thread local" feature of this repo in quarkus and use fast thread locals from there, on both dev/test/prod scenarios (they should just work)
This confuses me, don't we already use FastThreadLocal as a result of using Vert.x?
@geoand not for this specific one I can tell.
I've performed a micro benchmark to understand what are the others implications of NOT using fast thread locals for remove by using the benchmark at https://github.com/franz1981/java-puzzles/blob/main/src/main/java/red/hat/puzzles/concurrent/ThreadLocalScalability.java
with -DthreadLocals=1
on JDK 19.
-t 1
Benchmark Mode Cnt Score Error Units
ThreadLocalScalability.setGetNullout thrpt 10 109481209.044 ± 464386.775 ops/s
ThreadLocalScalability.setGetRemove thrpt 10 14307019.589 ± 49020.979 ops/s
-prof perfnorm of the remove case:
Benchmark Mode Cnt Score Error Units
ThreadLocalScalability.setGetRemove thrpt 10 14063430.793 ± 265518.974 ops/s
ThreadLocalScalability.setGetRemove:CPI thrpt 2 0.443 clks/insn
ThreadLocalScalability.setGetRemove:IPC thrpt 2 2.255 insns/clk
ThreadLocalScalability.setGetRemove:L1-dcache-load-misses thrpt 2 0.553 #/op
ThreadLocalScalability.setGetRemove:LLC-load-misses thrpt 2 0.005 #/op
ThreadLocalScalability.setGetRemove:LLC-loads thrpt 2 0.031 #/op
ThreadLocalScalability.setGetRemove:LLC-store-misses thrpt 2 0.020 #/op
ThreadLocalScalability.setGetRemove:LLC-stores thrpt 2 0.059 #/op
ThreadLocalScalability.setGetRemove:cycles thrpt 2 184.114 #/op
ThreadLocalScalability.setGetRemove:dTLB-load-misses thrpt 2 0.009 #/op
ThreadLocalScalability.setGetRemove:instructions thrpt 2 415.255 #/op
-t 4
Benchmark Mode Cnt Score Error Units
ThreadLocalScalability.setGetNullout thrpt 10 432425222.667 ± 7785949.059 ops/s
ThreadLocalScalability.setGetRemove thrpt 10 38913911.834 ± 24150978.392 ops/s
-prof perfnorm of the remove case:
Benchmark Mode Cnt Score Error Units
ThreadLocalScalability.setGetRemove thrpt 10 38707328.433 ± 28134808.681 ops/s
ThreadLocalScalability.setGetRemove:CPI thrpt 2 0.808 clks/insn
ThreadLocalScalability.setGetRemove:IPC thrpt 2 1.558 insns/clk
ThreadLocalScalability.setGetRemove:L1-dcache-load-misses thrpt 2 1.603 #/op
ThreadLocalScalability.setGetRemove:LLC-load-misses thrpt 2 0.002 #/op
ThreadLocalScalability.setGetRemove:LLC-loads thrpt 2 0.734 #/op
ThreadLocalScalability.setGetRemove:LLC-store-misses thrpt 2 0.020 #/op
ThreadLocalScalability.setGetRemove:LLC-stores thrpt 2 0.655 #/op
ThreadLocalScalability.setGetRemove:cycles thrpt 2 338.906 #/op
ThreadLocalScalability.setGetRemove:dTLB-load-misses thrpt 2 0.453 #/op
ThreadLocalScalability.setGetRemove:instructions thrpt 2 417.949 #/op
what the numbers shows is that setNull
have near linear scalability (and perfnorm agree to have the same IPC for both single/four threads) while remove
doesn't:
it is causing IPC to be halved and there's an increased activity of both L1 and LLC load-misses, likely the cause of such slowdown.
This is likely due to the interaction with the reference queue; in short, remove
is not just 1 order of magnitude slower, but it can cause some scalability issue.
NOTE In the bench I'm causing high contention over the same thread local instance: I'll try with many in a few to see if that change AND G1GC seems to be the most offended GC here, ParallelGC instead got a much reduced impact.
Going to check what happen with fast thread locals in a few.
That's a pretty big performance impact!
In Quarkus prod-mode, AFAIR, we use VertxThread
for the worker pool as well, so we should always be using Netty threads, thus it makes complete sense to try and use FastThreadLocal
Ok @Sanne and @geoand so the proper fixes are:
remove
: this would grant less performance but no scalability impact (still better perf than JDK ThreadLocal(s) and no weak refs!)remove
or not (set(null)
is just faster)The 2 things could happen separately but ideally we want both
That sounds reasonable to me.
Awesome! Ok we could implement this as two different flags, but I do wonder if we're not better off to inject a custom Quarkus implementation of some SPI which applies both tricks. (I understand such an SPI doesn't exist yet, so just wondering).
I do wonder if we're not better off to inject a custom Quarkus implementation of some SPI which applies both tricks. (I understand such an SPI doesn't exist yet, so just wondering).
Yeah, I had the same thought, but I can't find where I wrote it :)
@franz1981 The original idea behind StorageDeclaration
is for ThreadContextProvider
to simply declare their TL storages to SR-CP (see https://github.com/smallrye/smallrye-context-propagation/blob/main/testsuite/extra/src/main/java/io/smallrye/context/test/MyThreadContextProvider.java) and then CP, instead of calling the providers, will simply call set
on all the ThreadLocals and make context switching super fast.
Then, there's another thing designed to make thread locals super fast, which is that instead of handing out ThreadLocal
instances to the libs, we collect all declared StorageDeclaration
and create one field for each in custom threads, and override the Vert.x thread factory so that all Vert.x threads have extra fields. See https://github.com/smallrye/smallrye-context-propagation/blob/main/storage/src/main/java/io/smallrye/context/storage/spi/StorageManager.java
So the idea is that, for example, you have a lib like narayana
, which needs a ThreadLocal
for the current transaction:
ThreadLocal<Transaction> tl = new ThreadLocal<>()
you do ThreadLocal<Transaction> tl = StorageManager.threadLocal(Declaration.class);
static class Declaration implements StorageDeclaration<Transaction>{}
StorageDeclaration
and creates a subclass of VertxThread
: public class QuarkusVertxThread extends VertxThread { Transaction transaction; }
ThreadFactory
so that Vert.x creates those threads onlyThreadLocal
subtype:
class TransactionThreadLocal extends ThreadLocal<Transaction> {
public void set(Transaction tx) {
Thread t = Thread.currentThread();
if(t instanceof QuarkusVertxThread) {
((QuarkusVertxThread)t).transaction = tx;
} else {
super.set(tx);
}
}
// … etc
}
All this was already implemented years ago in branches, but nobody had time to test and benchmark it. I still don't have time, but I'd be happy to help you along :)
See:
Let me know if you're interested.
Thanks @FroMage I will go through your comment to check if can solve this (avoid remove and using the faster set(null) for us: I wasn't fully convinced by the noise in the micro-benchmark and I have involved the Jdk team to review what I believe to have noticed related remove. Anyway given that remove perform atomic operations and recreate the entry weak reference each time, it's still valid to move to a cheaper set (null) when possible ie prod class loader on Quarkus, regardless which thread local impl we choose.
I will likely experiment with a patch as you have suggested in the next weeks and it is indeed the best solution, perf-wise, although requires few code changes.
At the same time, Netty fast thread local is very similar but less invasive: differently from Java ThreadLocals, they would perform a simple array lookup in the FastThread instance. And, without introducing any "additional" dependency (ie Netty), but leveraging an already existing one, it doesn't involve many changes in Quarkus (a one liner? Maybe 2-3). Performing a micro-benchmark to compare it against a single field lookup solution should be trivial, will do it next week.
I will try schedule a call next week so we can decide together, given pro/cons of each solution.
I've spotted this In few reactive benchmarks using quarkus which doesn't disable context propagation (fairly commons for users that don't change default I believe): from what i can see there (and the code confirm it) the behaviour of
SmallRyeThreadContext
on close is to call ThreadLocal's removal, in order to save leaks to happen.Sadly
ThreadLocal::remove
have a huge impact perf wise due to https://github.com/openjdk/jdk19/blob/967a28c3d85fdde6d5eb48aa0edd8f7597772469/src/java.base/share/classes/java/lang/ThreadLocal.java#L570which end up calling
Reference::clear
viahttps://github.com/openjdk/jdk19/blob/967a28c3d85fdde6d5eb48aa0edd8f7597772469/src/java.base/share/classes/java/lang/ThreadLocal.java#L368
that can be seen in this flamegraph (in violet)
and at a deeper level
which show the interaction with the runtime to coordinate an async cleanup (it's a weak ref :"/ sob)
Given that such cost happen in the I/O thread it can be pretty severe and should be better saving it.
I order to improve this, would be great to be able to configure
SmallRyeThreadContext
via a lambda (once or more time) which can allow an external framework to perform some check to guide the close operation to perform a much cheaperset(null)
(which doesn't remove the entry) or full fatremove
as it is now: in the quarkus use case I think that if such lambda can query if the context of execution is within an I/O event loop thread (not just on a vertx ctx!), there won't be any need to remove the thread local entirely, because I/O threads will be alive for the whole application life-time.