Ladicek commented 2 years ago

Description

Currently, to implement InvocationContext.proceed() for interceptors, ArC generates for each intercepted method one forwarding method in the _Subclass and one anonymous class implementing Function<InvocationContext, Object>. That anonymous class obtains the argument values from InvocationContext.getParameters() and calls the forwarding method.

It might be beneficial to use lambdas instead. That would allow getting rid of the forwarding method (a lambda can directly invoke the superclass method) and an extra class (the lambda would implement Function itself).

Implementation ideas

This requires adding support for creating lambdas to Gizmo. That's relatively straightforward when support for capturing variables is not required, which is the case here.

quarkus-bot[bot] commented 2 years ago

/cc @manovotn, @matejvasek, @mkouba, @patriot1burke

Ladicek commented 2 years ago

I implemented this locally and here are the results.

I measured performance impact using the JMH benchmarks from: https://github.com/mkouba/arc-benchmarks/

I measured RSS impact using this one-off tool: https://github.com/Ladicek/arc-crazybeans

Benchmarks

Measuring

All measurements were done on my otherwise-idle desktop machine (running Ubuntu 22.10 with kernel Linux 5.19.0-23-generic; hardware-wise it's Ryzen 5950X with 64 GB of RAM) with the following tuning:

# select the "performance" CPU scaling governor
sudo cpupower frequency-set -g performance

# disable hyperthreading
echo off | sudo tee /sys/devices/system/cpu/smt/control

# disable turbo boost
echo 0 | sudo tee /sys/devices/system/cpu/cpufreq/boost

(I know more tuning could/should be done, but I'm no expert...)

Performance (OpenJDK 11.0.16)

main branch (821984a884c75775b95333a54359f3fd54702b96):

Benchmark                                        Mode  Cnt     Score     Error  Units
InterceptorBenchmark.complex                    thrpt   25  7900.025 ±  30.325  ops/s
InterceptorBenchmark.simple                     thrpt   25  7686.898 ± 272.104  ops/s
SubclassInstantiationBenchmark.complexSubclass  thrpt   25  2174.085 ±  25.565  ops/s
SubclassInstantiationBenchmark.simpleSubclass   thrpt   25  9375.078 ± 538.066  ops/s

my branch (3708d490a96550231f91d26e19f1dd4a5f8a71c9):

Benchmark                                        Mode  Cnt     Score     Error  Units
InterceptorBenchmark.complex                    thrpt   25  7798.061 ±  35.600  ops/s
InterceptorBenchmark.simple                     thrpt   25  7646.186 ± 192.110  ops/s
SubclassInstantiationBenchmark.complexSubclass  thrpt   25  2134.797 ±  32.218  ops/s
SubclassInstantiationBenchmark.simpleSubclass   thrpt   25  8948.386 ± 634.214  ops/s

JVM RSS (OpenJDK 11.0.16)

main branch: 106333.200 ± 683.611 kB (median 106148 kB, p99 108232 kB) my branch: 115900.200 ± 760.293 kB (median 115772 kB, p99 117680 kB)

Native RSS (GraalVM 22.3.0 Java 11 CE)

main branch: 33580.440 ± 10.635 kB (median 33576 kB, p99 33624 kB) my branch: 33727.600 ± 12.668 kB (median 33724 kB, p99 33772 kB)

Native binary size (GraalVM 22.3.0 Java 11 CE)

main branch: 35710712 B my branch: 35739384 B

Conclusion

There's an unwritten rule in Quarkus that runtime code should not contain lambdas because they are memory-hungry. This experiment just confirms that, especially on regular JVM. Overall, the existing strategy is better and moving to lambdas makes performance worse.

mkouba commented 2 years ago

Thanks for this interesting experiment! I know the results are a bit frustrating but in the words of a great (yet imaginary) Cimrman: "Somebody had to probe this dead end of human knowledge and announce to the world: Not this way, friends!"

quarkusio / quarkus

ArC: consider using lambdas to call intercepted methods #28956