jvm-profiling-tools / perf-map-agent

A java agent to generate method mappings to use with the linux `perf` tool
GNU General Public License v2.0
1.65k stars 260 forks source link

High CPU time in vtable chunks #13

Closed kotaravi closed 9 years ago

kotaravi commented 9 years ago

During our code profiling tests on 16-core systems running 8-16 worker threads, we are observing a significant CPU time (25%) in "vtable chunks" based on the generated symbol map. I think this symbol is listed multiple times as well in the symbol map. This overhead also increases as the number of running threads is gradually raised from 8 to 16.

Is this an overhead of perf-map-agent or something else ? Would like to also confirm if its related to virtual methods in our code.

Another unrelated issue is that perf top always displays "unable to open vsyscalls". Is there a way to resolve it. Does this mean that perf is unable to display system calls and their CPU time ? All processes are being executed by the "root" user.

jrudolph commented 9 years ago

Yes, "vtable chunks" is related to dynamic dispatch. Unfortunately, I don't know of any good tools that would help explaining where these calls originate from (you would need to walk the stack to find out). Since you mention worker threads a blind guess would be that you have very short lived tasks and calling the tasks itself from the worker threads produce part of this overhead.

Another unrelated issue is that perf top always displays "unable to open vsyscalls".

I don't know about that, I guess that's an issue with perf top.

kotaravi commented 9 years ago

Thank you, Johannes. Based on perf top output, we initially suspected vtable lookups and dynamic dispatch as a key issue but could not really confirm it. perf-map-agent is a really great tool in that it separates out this dynamic dispatch overhead from user code. Most Java profilers appear to include this overhead against method time or is unaccounted for.

We certainly have a lot of “light” virtual methods that are called frequently and are likely contributing to this issue. We will review our invokevirtual call sites along with high TypeProfile counts to avoid dynamic dispatch for such code paths. Hopefully, we can also benefit from inlining as a result of it.

I have a few follow up questions — I am hoping @jrudolph or other experts on this forum can comment upon.

(1) Previously (last year) this overhead showed up against the symbol: vtable stub. Are these the same/ similar and defined in the Java runtime or is it defined in theperf-map-agent code ?

(2) Is the reported CPU time for vtable chunks primary due to dynamic dispatch or does it also include any other tasks ? How accurate is this timing ?

(3) Thoughts on why dynamic dispatch does not scale well. We’ve observed that it is also CPU-architecture-dependent to a large degree. Also curious if this has been an issue for anyone else.

Thanks, Ravi

jrudolph commented 9 years ago

Wrt (1):

Now that you say it, I can only remember seeing vtable stub and not vtable chunks. They are both reported through the JVMTI DynamicCodeGenerated callback so perf-map-agent will just output whatever your JVM tells it.

I had a quick look and there's a slight code structure change going from JDK 8 -> JDK 9 which could explain that there was an (intermittent?) name change for the same thing:

http://hg.openjdk.java.net/jdk9/jdk9/hotspot/file/c25bfaaed7f2/src/share/vm/code/vtableStubs.cpp#l62

(2) TBH I don't know how accurate the reported times are. My current assumption is that for each non-optimizable (e.g. monomorphic) virtual call from a compiled method the virtual stub is run once. E.g. see this description: https://wikis.oracle.com/display/HotSpotInternals/VirtualCalls

On the other hand, as long as I haven't seen actual numbers accompanied with the code positions from where these calls originate I cannot say from experience that these numbers correspond to anything you can do something about. (This alludes to a paper that evaluated Java profilers measuring if the profiler produces "actionable" results which you could use to actually optimize the profiled program.)

(3) Wrt "scaling", I think it's known that dynamic program flow with unpredictable branching has some direct and indirect costs. The question is if the JVM amplifies some of these costs in certain situations. If you look at http://shipilev.net/blog/2015/black-magic-method-dispatch/ it seems that, yes, some calls through invokevirtual and invokeinterface are more expensive than a simulation with other branches constructs like if/else or switch.

IMO in most situations the solution is still not to micro-optimize these dynamic parts but to increase the size of code blocks that have a statically predictable execution flow. E.g. if the problem should really be dispatching of tasks, you'd need to increase the amount of work done during a task, etc. For other kind of work this just isn't as easily optimized (e.g. data structure processing with many different node types like ASTs).

Wrt CPU-architecture dependence: Out of interest: What kind of differences did you see and on which architectures did you test?

The architecture is quite a big factor, apart from the hardware itself you change to a different JIT compiler and a different performance event implementation, so it's hard to say if it makes sense to compare results (though, I agree, it would be nice if you could optimize for a larger range of architectures).

jrudolph commented 9 years ago

Thanks for starting this discussion, @kotaravi. I'm going to close this for now. Please comment or reopen if there are any questions left open.

kotaravi commented 9 years ago

Thank you @jrudolph. Your insights and perf-map-agent itself have helped set a new direction in our approach to solving our scalability issues. Appreciate all your support.

We compared Intel Westmere processors vs. SandyBridge and observed that the SB architecture based systems were impacted much more by ~20%. We've also used Intel's emon to quantify CPI, LLC miss latencies to get an understanding of how that contributes to the overall slowdown. We have also tested/eliminated NUMA as being an issue by controlling processor affinity with taskset and memory locality with numactl.

Finally, we have some work in progress to make code changes wrt. to being vtable and inlining friendly. I am not sure we will have test results and comparisons anytime soon. We can leave this issue closed. Thanks.

jrudolph commented 9 years ago

@kotaravi very interesting, thanks for sharing your insights.