[Proposal] Invoke interface optimizations

fwbrasil commented 5 years ago

I've been working on prototypes to optimize interface dispatches in Graal. I'm planning to start working on pull requests but would like to get your feedback on the overall approach before proceeding.

Problem

Scala codebases usually define computations using composable APIs that build a graph to be later executed by an interpreter. These APIs include abstractions like Scala's and Twitter's Futures for asynchronous computations, Stitch and Clump for batching of remote asynchronous calls, Monix's Task and ZIO's IO for defining pure computations, and many other monad-like APIs.

These abstractions introduce an execution behavior usually not present in traditional Java applications. Given that the definition of the computation graph involves several method invocations, the interpreter typically doesn't get inlined with the definition of the computation, which produces a high number of invokeinterface calls. At interpretation time, there isn't relevant profiling information to determine the concrete implementations of the transformations, so they don't get optimized.

For instance, this is a screenshot of a VTune profiling session of an application that uses Stitch:

The itable stub is the mechanism that Hotspot uses to dispatch invokeinterface calls, and, as we can observe, most of the CPU usage goes into executing these stubs.

Better inlining might be able to reduce the number of invokeinterface calls but, given the nature of these codebases, they'll always produce a high number of interface dispatches.

Solution

I've been exploring alternatives to optimize interface dispatches in Graal. The idea is to have a set of strategies that leverage profiling information and common patterns to inline invokeinterface stubs. Graal would evaluate these strategies and choose the cheapest one.

Strategies

I've identified a few strategies so far:

Common offset

If the observed implementations of an interface resolve to the same method offset, it's possible to inline the stub at the common offset and introduce a guard to deoptimize if a new implementation is observed.

Common superclass

If the observed implementations extend a common super class that contains the method, it's possible to dispatch through the super class method and introduce a guard to deoptimize if the object doesn't extend the super class. For instance, all Function1 generated by the Scala compiler have an AbstractFunction1 superclass. The same happens for a number of interfaces in Scala collections.

Single method

If the observed implementations have a single method (besides the ones from Object), it's possible to dispatch directly at the offset and introduce a guard to deoptimize if the itable is larger than expected.

Caching

At JIT compilation time, it's possible to determine the offsets of each observed implementation and create a TypeSwitch that does the dispatch directly at the specific offsets. This is a simple strategy but might be cheaper for some cases, especially since Hotspost doesn't share itable stubs in most architectures and thus doesn't leverage branch prediction well.

Better `TypeSwitch`

Graal currently compiles TypeSwitch nodes using linear search (sequential switch strategy). I've been exploring an approach to compile TypeSwitches based on the addresses of the classes. That would allow Graal to evaluate other switch strategies and choose the cheapest one. Given that some of the invokeinterface strategies would use TypeSwitch nodes, optimizing it is important to make them effective.

Additionally, I've been working on a new switch compilation strategy based on hashing that would work well for sparse keys (which is the case for the class addresses).

Development phases

If you're ok with the overall approach, I'm planning on submitting multiple pull requests:

Introduce the new hashing strategy to compile switches
Implement a better heuristic to decide between size of the switch tables and cost of the strategies.
Make the switch strategies accept long values (they currently work only for integer) and compile TypeSwitch using class addresses so it can leverage the different switch strategies
Introduce the mechanism to choose between different invokeinterface strategies initially with two simple strategies: Caching and Fallback (that falls back to the itable stubs)
Create individual pull requests for each strategy

christhalinger commented 5 years ago

@tkrodriguez what do you think about this?

fwbrasil commented 5 years ago

@dougxc could you also share your thoughts on this proposal?

tkrodriguez commented 5 years ago

I think trying to improve interface dispatch seems like a good idea. I think it would be very valuable to have a collection of JMH microbenchmarks that illustrates the Scala patterns that we think are interesting. At rough glance your suggestions seem like reasonable places to start but I think we also might want to investigate whether improvements on the HotSpot side are required/helpful in doing a better job here. CHA for instance throws up its hands with default methods, https://bugs.openjdk.java.net/browse/JDK-8036580, though I don't know if that matters for Scala.

We might also want to improve JVMCI support for queries around the type hierarchy or in some way expand CHA to cover some of these cases. It might be the case that we can implement everything we need on the compiler side but I think we'll just have to see.

Regarding the typeswitch improvements, I agree they might be necessary if we end up with really complex typeswitches. We'll have to be careful in the backend code generation to track them all the way through as JavaConstants and not just as longs. It's also only appropriate for runtimes where the metadata is non-moving. That's true of HotSpot and I think SVM but we probably need to abstract that appropriately.

So basically it seems like a good piece of work to pursue. Having a nice corpus of microbencharks to drive it all and then proceeding from that seems like the best thing to me.

fwbrasil commented 5 years ago

@tkrodriguez thank you for the feedback! :)

The optimizations I've been working on are based on profiling sessions of a few of our Scala services at Twitter. The current bottleneck is the itable stub as shown in the issue description, so it seems a good subject for optimization. I've also been using jmh benchmarks to reproduce the performance issues and validate the optimizations.

By the way, I wonder if the CHA issue you mentioned could be related to the performance regression we see with Scala 2.12 since it's the first scala version that emits bytecode with java 8 features like default methods. I'll look into it.

fwbrasil commented 5 years ago

@tkrodriguez I'm looking into how to determine the code size of each switch strategy. I considered creating a new masm instance, emitting the code, and then discarding it, but it doesn't seem feasible because of a few methods that mutate the lir builder and the labels. I think I'll change EffortClosure to include an estimate of the code size, but it might not be precise for all platforms and might be wrong if someone changes the emitted code but forgets to update the estimates. Any suggestions?

tkrodriguez commented 5 years ago

Sorry this was sitting unsent in a browser window

Maybe the ideal is to have the platform dependent emitter have an estimation mode so the estimation and code generation are co-located. I'm not sure we have to get too fancy but I'm having trouble imagining what the effects of the current policy are either. I'd expect/hope the hashing strategy is competitive with a binary search strategy.

oracle / graal