Avoid ClassLoader identityHashCode and fix concurrency of ctx manager lookup table

franz1981 commented 5 months ago

It replaces https://github.com/smallrye/smallrye-context-propagation/pull/399

Quoting the @Sanne comments at https://github.com/smallrye/smallrye-context-propagation/pull/399#issuecomment-1370764330

we could optimize for very few ? e.g. I believe Quarkus in the most common production configuration has two - seems fit for a linear scan. I don't know about other runtimes using this library but I'd expect them to run also with very few classloaders; e.g. WildFly has many classloaders but I expect only a handful would be visible to this - theory that would need to be verified though.

If this approach is pushing too far the assumption about being very few ...

I could consider adding property to let the application decide if uses this one, or the previous implementation (which I need to correctly pack in a separate class)
I can make it super dynamic and expose instead, a sensible cutoff value (or none, by deciding ourselves for the users!), which can transform the array map into a proper one (likely need something thread-safe :/)

Sanne commented 5 months ago

Looks great!

franz1981 commented 5 months ago

@Sanne I've prepared a microbenchmark for this

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Warmup;

import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.TimeUnit;

@State(Scope.Thread)
@Warmup(iterations = 10, time = 400, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 5, time = 200, timeUnit = TimeUnit.MILLISECONDS)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Fork(2)
public class IndentityLookup {

    private SingleWriterCopyOnWriteArrayIdentityMap<ClassLoader, Object> arrayMap;
    private Map<ClassLoader, Object> hashMap;
    @Param({ "2", "10"})
    public int size;
    private ClassLoader missingClassLoader;
    private ClassLoader firstClassLoader;
    private ClassLoader lastClassLoader;

    @Setup
    public void init() {
        arrayMap = new SingleWriterCopyOnWriteArrayIdentityMap<>();
        hashMap = new HashMap<>();

        // create N different instances of ClassLoader
        missingClassLoader = new ClassLoader() { };

        for (int i = 0; i < size; i++) {
            var classLoader = new ClassLoader() { };
            if (i == 0) {
                firstClassLoader = classLoader;
            }
            if (i == size - 1) {
                lastClassLoader = classLoader;
            }
            arrayMap.put(classLoader, new Object());
            hashMap.put(classLoader, new Object());
        }
    }

    @Benchmark
    public Object arrayMapGetMissing() {
        return arrayMap.get(missingClassLoader);
    }

    @Benchmark
    public Object hashMapGetMissing() {
        return hashMap.get(missingClassLoader);
    }

    @Benchmark
    public Object arrayMapGetFirst() {
        return arrayMap.get(firstClassLoader);
    }

    @Benchmark
    public Object hashMapGetFirst() {
        return hashMap.get(firstClassLoader);
    }

    @Benchmark
    public Object arrayMapGetLast() {
        return arrayMap.get(lastClassLoader);
    }

    @Benchmark
    public Object hashMapGetLast() {
        return hashMap.get(lastClassLoader);
    }
}

These are the results

Benchmark                           (size)   Mode  Cnt     Score    Error   Units
IndentityLookup.arrayMapGetFirst         2  thrpt   10  1436.097 ± 33.765  ops/us
IndentityLookup.arrayMapGetFirst        10  thrpt   10  1440.117 ± 37.064  ops/us
IndentityLookup.arrayMapGetLast          2  thrpt   10   666.662 ±  3.000  ops/us
IndentityLookup.arrayMapGetLast         10  thrpt   10   332.156 ±  5.896  ops/us
IndentityLookup.arrayMapGetMissing       2  thrpt   10   640.518 ±  7.431  ops/us
IndentityLookup.arrayMapGetMissing      10  thrpt   10   301.235 ±  2.458  ops/us
IndentityLookup.hashMapGetFirst          2  thrpt   10   651.625 ±  5.688  ops/us
IndentityLookup.hashMapGetFirst         10  thrpt   10   647.957 ±  6.822  ops/us
IndentityLookup.hashMapGetLast           2  thrpt   10   652.398 ±  7.165  ops/us
IndentityLookup.hashMapGetLast          10  thrpt   10   652.113 ±  1.399  ops/us
IndentityLookup.hashMapGetMissing        2  thrpt   10   865.340 ±  8.298  ops/us
IndentityLookup.hashMapGetMissing       10  thrpt   10   870.814 ±  6.873  ops/us

The worst case scenario (if we ignore, the missing case, which shouldn't happen, right @FroMage ?) is *GetLast and for such, with 2 class loaders, the performance vs HashMap is actually the same. With 10 class loaders, instead, the array map is even slower (!!!). I've yet to analyze the results, but as it is now I'm putting this in draft

franz1981 commented 5 months ago

mmm I'm not very happy, still, because my benchmark seems to not trigger the same behavior as the real code

which is very different from the real one

that's hitting https://github.com/openjdk/jdk/blob/jdk-21%2B35/src/hotspot/share/prims/jvm.cpp#L611

Still investigating what's going on. -prof perfasm seems to agree that the hash code used is different: it seems related the HashMaps hash method which maybe have its type polluted; let's try differently...

In short, it could:

be the original problem a quirk of the profiler which was attached before the identity hash code is using the intensified version: I can easily verify it by removing the warmup od JMH and attach the profiler
be a real problem due to HashMap call chain till hash code which is not inlined AND has observed too many types there (but for the type profile pollution I would have expected a vtable_stub presence, which is not..)

franz1981 commented 5 months ago

@Sanne This is FUN

look at the benchmark at: https://github.com/franz1981/java-puzzles/blob/main/src/main/java/red/hat/puzzles/concurrent/IndentityLookup.java

these are the numbers, now fixed to match the actual behavior in the benchmark (and MAYBE is what you can observe in hibernate too (!!!!) - eg )

Benchmark                                   (disableHashCodeIntrinsics)  (size)   Mode  Cnt     Score    Error   Units
IndentityLookup.arrayMapGetFirst                                   true       2  thrpt   10  1453.559 ± 15.492  ops/us
IndentityLookup.arrayMapGetFirst                                   true      10  thrpt   10  1485.773 ± 18.988  ops/us
IndentityLookup.arrayMapGetLast                                    true       2  thrpt   10   687.806 ± 29.120  ops/us
IndentityLookup.arrayMapGetLast                                    true      10  thrpt   10   337.043 ±  5.881  ops/us
IndentityLookup.arrayMapGetMissing                                 true       2  thrpt   10   652.220 ±  7.062  ops/us
IndentityLookup.arrayMapGetMissing                                 true      10  thrpt   10   342.773 ±  2.803  ops/us
IndentityLookup.classLoaderHashCode                                true       2  thrpt   10   109.970 ± 12.456  ops/us
IndentityLookup.classLoaderHashCode                                true      10  thrpt   10   108.434 ± 14.299  ops/us
IndentityLookup.hashMapGetFirst                                    true       2  thrpt   10   104.966 ±  7.004  ops/us
IndentityLookup.hashMapGetFirst                                    true      10  thrpt   10   107.667 ±  0.808  ops/us
IndentityLookup.hashMapGetLast                                     true       2  thrpt   10   108.373 ±  0.914  ops/us
IndentityLookup.hashMapGetLast                                     true      10  thrpt   10   108.948 ±  1.086  ops/us
IndentityLookup.hashMapGetMissing                                  true       2  thrpt   10   110.337 ±  0.735  ops/us
IndentityLookup.hashMapGetMissing                                  true      10  thrpt   10   109.889 ±  1.061  ops/us

for example now correctly produce

Screenshot from 2024-04-08 14-01-23

which matches what we observed in the real code.

I can undraft, but now ...

why we have synchronized vs ClassLoader(s) instances in Quarkus???

because if we won't, there is no need of this collection...

franz1981 commented 5 months ago

For hibernate too: Screenshot from 2024-04-08 14-17-47

given that identity hash map piggyback the same intrinsics

FroMage commented 5 months ago

why we have synchronized vs ClassLoader(s) instances in Quarkus???

I don't think this is needed in Quarkus, honestly. This is for WildFly and other application servers with one CL per application.

franz1981 commented 5 months ago

I don't understand your comment @FroMage : My point is that the performance issue I've discovered in Quarkus happen because some of the Classloader(s) stored in the map are used as objects to perform synchronization...and it seems weird to me.

FroMage commented 5 months ago

I'm saying that supporting one set-up per CL was added for Application Servers, not for Quarkus. In the case of Quarkus, we have a single ClassLoader and a single SmallRyeContextManager, so this entire map is not even useful for Quarkus

franz1981 commented 5 months ago

In the case of Quarkus, we have a single ClassLoader and a single SmallRyeContextManager, so this entire map is not even useful for Quarkus

I would say the opposite actually: look at the numbers, I can show what happen with a single one too.

This is with size = 1:

Benchmark                                   (disableHashCodeIntrinsics)  (size)   Mode  Cnt     Score    Error   Units
IndentityLookup.arrayMapGetFirst                                   true       1  thrpt   10  1474.012 ± 19.736  ops/us
IndentityLookup.arrayMapGetLast                                    true       1  thrpt   10  1477.383 ± 18.811  ops/us
IndentityLookup.arrayMapGetMissing                                 true       1  thrpt   10   636.047 ±  5.564  ops/us
IndentityLookup.hashMapGetFirst                                    true       1  thrpt   10   108.808 ±  1.781  ops/us
IndentityLookup.hashMapGetLast                                     true       1  thrpt   10   109.350 ±  0.794  ops/us
IndentityLookup.hashMapGetMissing                                  true       1  thrpt   10   110.651 ±  2.243  ops/us
IndentityLookup.classLoaderHashCode                                true       1  thrpt   10   112.808 ± 12.129  ops/us

The hashMap* variants are ~14X at worst slower throughput (we usually don't have the missing case), and fixing it will make such lookup a no cost.

FroMage commented 5 months ago

I am not questioning your finding that this is expensive. I'm saying in Quarkus we don't even need to do this expensive thing, so we should just avoid doing it.

Sanne commented 5 months ago

@FroMage we do actually have multiple classloaders - but yes since they are very simple, that's in fact why I had suggested in the linked discussion that we could do something "simpler"

FroMage commented 5 months ago

@FroMage we do actually have multiple classloaders

Yeah, but not in the way it's intended for Application Servers where each app has one CL and they have different config for CP because they're all different applications.

Sanne commented 5 months ago

@FroMage looks at the data.. there's evidence that this is a problem, I've seen these same flames sparkling myself as well :)

Lol aren't we all just divining shamans..

Sanne commented 5 months ago

I am not questioning your finding that this is expensive. I'm saying in Quarkus we don't even need to do this expensive thing, so we should just avoid doing it.

Avoiding the lookup would be even better of course. But do you have a solution?

franz1981 commented 5 months ago

I'm now checking if i can retrieve this information using some JDK event or by modifying the JDK myself...

Right now is not nice I cannot see MonitorExit...so maybe is not capturing what I need:

Screenshot from 2024-04-08 15-02-36

on https://chriswhocodes.com/jfr_jdk17.html

FroMage commented 5 months ago

Avoiding the lookup would be even better of course. But do you have a solution?

I might, let me reopen this project after several years… Also, not what I had planned for this afternoon.

FroMage commented 5 months ago

I am swapping back memory of this from tape storage, let me tell you :-/

FroMage commented 5 months ago

Can you try this PR, @franz1981 ? https://github.com/quarkusio/quarkus/compare/main...FroMage:quarkus:cp-perf-manager?expand=1

Note that I'm wondering what sort of application calls these methods, but whatever.

Sanne commented 5 months ago

@FroMage hooo that looks so nice.. but you could have given this hint a year ago when we spotted this problem! :dagger: :cupid:

franz1981 commented 5 months ago

@FroMage these are the best solutions to me :) (removing code win too) Soooo...I've sent a run in a benchmrk, will take some, and hope will compile and I let you know, but if it does what I think is doing...it's done :)

no it didn't work...https://github.com/FroMage/quarkus/actions/runs/8600808107/job/23566549448

FroMage commented 5 months ago

Yeah, silly me, forgetting formatting :) Fixed now.

franz1981 commented 5 months ago

It seems that it has run somehow - sadly i am using a CI job for performance testing. If it was on my laptop I would have take care of doing it myself, while on the job is a PITA, because I have to update the scripts just for this run back/forth (and I don't like Jenkins, nor it likes me)

franz1981 commented 5 months ago

I have to double check but I can see 2 different ways to make it happens https://github.com/openjdk/jdk/blob/747582484ce89e16661ef917a89adb52f5adc2e6/src/hotspot/share/opto/library_call.cpp#L4577-L4592

FroMage commented 5 months ago

Tests did not pass. I'm looking into why

FroMage commented 5 months ago

It appears some code might be using it before it's set up. Which might be revealing of a problem we've always had.

FroMage commented 5 months ago

Mutiny, and FT, specifically.

franz1981 commented 5 months ago

You mean your change @FroMage ? Because I like it so much that I cannot see anymore a reason to have this pr opened; it is just a doc why we end up in another place, no?

FroMage commented 5 months ago

Yeah, my Quarkus fix. It doesn't run CI. But I should be able to fix that.

franz1981 commented 5 months ago

Let me close this than, so can resume any conversation elsewhere (if needed).

smallrye / smallrye-context-propagation

Avoid ClassLoader identityHashCode and fix concurrency of ctx manager lookup table #443

why we have synchronized vs ClassLoader(s) instances in Quarkus???