Open fwbrasil opened 5 years ago
@tkrodriguez what do you think about this?
@dougxc could you also share your thoughts on this proposal?
I think trying to improve interface dispatch seems like a good idea. I think it would be very valuable to have a collection of JMH microbenchmarks that illustrates the Scala patterns that we think are interesting. At rough glance your suggestions seem like reasonable places to start but I think we also might want to investigate whether improvements on the HotSpot side are required/helpful in doing a better job here. CHA for instance throws up its hands with default methods, https://bugs.openjdk.java.net/browse/JDK-8036580, though I don't know if that matters for Scala.
We might also want to improve JVMCI support for queries around the type hierarchy or in some way expand CHA to cover some of these cases. It might be the case that we can implement everything we need on the compiler side but I think we'll just have to see.
Regarding the typeswitch improvements, I agree they might be necessary if we end up with really complex typeswitches. We'll have to be careful in the backend code generation to track them all the way through as JavaConstants and not just as longs. It's also only appropriate for runtimes where the metadata is non-moving. That's true of HotSpot and I think SVM but we probably need to abstract that appropriately.
So basically it seems like a good piece of work to pursue. Having a nice corpus of microbencharks to drive it all and then proceeding from that seems like the best thing to me.
@tkrodriguez thank you for the feedback! :)
The optimizations I've been working on are based on profiling sessions of a few of our Scala services at Twitter. The current bottleneck is the itable stub as shown in the issue description, so it seems a good subject for optimization. I've also been using jmh benchmarks to reproduce the performance issues and validate the optimizations.
By the way, I wonder if the CHA issue you mentioned could be related to the performance regression we see with Scala 2.12 since it's the first scala version that emits bytecode with java 8 features like default methods. I'll look into it.
@tkrodriguez I'm looking into how to determine the code size of each switch strategy. I considered creating a new masm
instance, emitting the code, and then discarding it, but it doesn't seem feasible because of a few methods that mutate the lir builder and the labels. I think I'll change EffortClosure
to include an estimate of the code size, but it might not be precise for all platforms and might be wrong if someone changes the emitted code but forgets to update the estimates. Any suggestions?
Sorry this was sitting unsent in a browser window
Maybe the ideal is to have the platform dependent emitter have an estimation mode so the estimation and code generation are co-located. I'm not sure we have to get too fancy but I'm having trouble imagining what the effects of the current policy are either. I'd expect/hope the hashing strategy is competitive with a binary search strategy.
I've been working on prototypes to optimize interface dispatches in Graal. I'm planning to start working on pull requests but would like to get your feedback on the overall approach before proceeding.
Problem
Scala codebases usually define computations using composable APIs that build a graph to be later executed by an interpreter. These APIs include abstractions like Scala's and Twitter's
Future
s for asynchronous computations,Stitch
andClump
for batching of remote asynchronous calls, Monix'sTask
and ZIO'sIO
for defining pure computations, and many other monad-like APIs.These abstractions introduce an execution behavior usually not present in traditional Java applications. Given that the definition of the computation graph involves several method invocations, the interpreter typically doesn't get inlined with the definition of the computation, which produces a high number of
invokeinterface
calls. At interpretation time, there isn't relevant profiling information to determine the concrete implementations of the transformations, so they don't get optimized.For instance, this is a screenshot of a VTune profiling session of an application that uses
Stitch
:The
itable stub
is the mechanism that Hotspot uses to dispatchinvokeinterface
calls, and, as we can observe, most of the CPU usage goes into executing these stubs.Better inlining might be able to reduce the number of
invokeinterface
calls but, given the nature of these codebases, they'll always produce a high number of interface dispatches.Solution
I've been exploring alternatives to optimize interface dispatches in Graal. The idea is to have a set of strategies that leverage profiling information and common patterns to inline
invokeinterface
stubs. Graal would evaluate these strategies and choose the cheapest one.Strategies
I've identified a few strategies so far:
Common offset
If the observed implementations of an interface resolve to the same method offset, it's possible to inline the stub at the common offset and introduce a guard to deoptimize if a new implementation is observed.
Common superclass
If the observed implementations extend a common super class that contains the method, it's possible to dispatch through the super class method and introduce a guard to deoptimize if the object doesn't extend the super class. For instance, all
Function1
generated by the Scala compiler have anAbstractFunction1
superclass. The same happens for a number of interfaces in Scala collections.Single method
If the observed implementations have a single method (besides the ones from
Object
), it's possible to dispatch directly at the offset and introduce a guard to deoptimize if the itable is larger than expected.Caching
At JIT compilation time, it's possible to determine the offsets of each observed implementation and create a
TypeSwitch
that does the dispatch directly at the specific offsets. This is a simple strategy but might be cheaper for some cases, especially since Hotspost doesn't share itable stubs in most architectures and thus doesn't leverage branch prediction well.Better
TypeSwitch
Graal currently compiles
TypeSwitch
nodes using linear search (sequential switch strategy). I've been exploring an approach to compileTypeSwitch
es based on the addresses of the classes. That would allow Graal to evaluate other switch strategies and choose the cheapest one. Given that some of the invokeinterface strategies would useTypeSwitch
nodes, optimizing it is important to make them effective.Additionally, I've been working on a new switch compilation strategy based on hashing that would work well for sparse keys (which is the case for the class addresses).
Development phases
If you're ok with the overall approach, I'm planning on submitting multiple pull requests:
TypeSwitch
using class addresses so it can leverage the different switch strategiesinvokeinterface
strategies initially with two simple strategies: Caching and Fallback (that falls back to the itable stubs)