Various thread-related improvements

jvm-profiling-tools / honest-profiler

A sampling JVM profiler without the safepoint sample bias

https://github.com/RichardWarburton/honest-profiler/wiki

MIT License

1.25k stars 146 forks source link

Various thread-related improvements #104

Closed cinquin closed 8 years ago

cinquin commented 8 years ago

This is a set of changes I made while trying to understand why using short sampling intervals leads to either a deadlock or extremely low performance (in one instance a sampling interval of 15ms is fine, with the program completing in a few seconds, but with 5ms the program does not complete in a useful time frame). Haven't got to the bottom of it yet, but for what it's worth according to GDB backtraces it may be that the profiler does force the JVM to stop at a safepoint when inspecting methods and determining line numbers (haven't checked the detail of the corresponding JVM code). Perhaps sampling events in short succession just effectively stop progression of the Java program being executed by the JVM.

nitsanw commented 8 years ago

I find it hard to believe the JVM is stopped at a SP for line of code reporting. This is not what I see when running with PrintGCApplicationStoppedTime. Please run your profiling session with the following flags: "-XX:+PrintGCApplicationStoppedTime -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1" This will let you know the reason for each global safepoint. In my testing there are very few SP as the run progresses and none are related to the profiler. If you can log this as an issue with some reproduction steps it would help.

cinquin commented 8 years ago

I find it hard to believe the JVM is stopped at a SP for line of code reporting. This is not what I see when running with PrintGCApplicationStoppedTime. Please run your profiling session with the following flags: "-XX:+PrintGCApplicationStoppedTime -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1" This will let you know the reason for each global safepoint. In my testing there are very few SP as the run progresses and none are related to the profiler. If you can log this as an issue with some reproduction steps it would help.

I tried those options and indeed the results suggest SP are not an issue.

cinquin commented 8 years ago

This should not be an issue we're trying to avoid. This is valid behaviour to sample 2 threads concurrently and both the signal handler code and the JVM should be capable of doing so.

For what it's worth, the set of changes in this pull request reduces the frequency at which the problem discussed in issue #80 occurs (it does not eliminate the problem though). I have not observed a strong impact on the effective sampling rate (although the impact may depend on use cases, and I have not performed detailed benchmarking). I have added a stack trace that will perhaps help get to the bottom of issue #80.

cinquin commented 8 years ago

Please see this more detailed analysis of the problem. Suggestions welcome on how to best address it.

RichardWarburton commented 8 years ago

Sorry @cinquin and @nitsanw - I've completely lost track of what is happening in this pull request ;) And my apologies for the delay in getting back to things.

@cinquin did you address the issues that @nitsanw raised about safepoints not being an issue, because I'm sure that he's correct on that front? If so then I can probably look at merging this PR.

cinquin commented 8 years ago

@RichardWarburton Yes I agree that safepoints are not an issue. I also agree that it is not desirable to merge a2043c1 : now that I've narrowed down the problem in issue #80 it is apparent that it is kind of accidental that this change reduces crash frequency, and that it does not address the underlying issue. I think the other commits are fine to merge (I did not completely follow the comment by @nitsanw on 56bd686 ; I'm sure this could be implemented in a different way, but I think it does do the job as is; I'm happy for the specific implementation to be changed.)

RichardWarburton commented 8 years ago

Hi @cinquin,

What I've decided to do is to cherry-pick the commits c0c1d2377bc0bb1538b9c9e77986396abd35d38e and 3a498e02adbeff20cc3cc21ece122bde3c50e6ee from your pull request and merge them in. Thanks for the contribution, its much appreciated.

I understand what you're going for with 56bd686fdc03bfe33e5f376727016a92fd00b5f9 but it seems like locks are a bit overkill as @nitsanw says. I think I need to re-evaluate the log file format in general, see issue #106 which I've just filed to take note of.

I think we all agree that a2043c193c0a9fe5997efee66e6c8c98f48cf9c8 is unnecessary and so shouldn't be merged.

RichardWarburton commented 8 years ago

Side-note: its a bit of a shame that the git UI won't let me cherry pick the things and I need to do it manually, but even though this pull request is nominally closed, it was still a much appreciated contribution.

cinquin commented 8 years ago

That all sounds good to me; thanks for accepting those contributions. I think I may have a fix for the hangs under OS X. I'll create a separate pull request once I've tested it a bit more. I'll also add a separate comment later on 56bd686 (I'm not sure file corruption is the problem).