stagemonitor / stagemonitor-mailinglist

GitHub issues abused as a mailing list
3 stars 0 forks source link

Instrumentation overhead high but only when in Docker container? #71

Open ryanrupp opened 6 years ago

ryanrupp commented 6 years ago

I'm trying to troubleshoot a performance issue I'm having with Stagemonitor but it only actually is a problem when running in a container. To isolate this problem I've made a change to the Sprint Pet Clinic demo setup for Stagemonitor to introduce a method that gets called a lot in the request e.g. 1 million times (just to amplify this issue) - you can find the change here. The change is to the "Veterinarian" tab, basically what I'm seeing is:

1) Local java + stagemonitor OFF = < 200ms 2) Local java + stagemonitor ON = < 200ms 3) Containerized docker + stagemonitor OFF = < 200ms 4) Containerized docker + stagemonitor ON = ~7 seconds

So for some reason this overhead is only really noticeable with the combination of container + stagemonitor. Running outside of a container with stagemonitor works fine and running in a container with stagemonitor works fine, just the combination is producing large overhead somehow. The docker image I'm using is openjdk:8u162-jdk, can find more details on its Docker hub page here. I used similar Java versions although locally I was running 8u161.

Any ideas? I'm trying to think of other variables here to narrow this down. I tried taking a look at the JIT logs via -XX:+PrintCompilation but nothing immediately stood out as being different between docker + stagemonitor ON vs local + stagemonitor ON. I've seen this issue in production on longer lived JVMs so I don't think it's some of the common benchmarking pitfalls like not waiting long enough for JIT compilation to have occurred. Our setup in stagemonitor is to basically include instrumentation only on our packages e.g. "com.mycompany". With this now though we've added some excludes to try to weed out those granular/low level methods but it's not really ideal and given this works fine locally not sure we would even have to do this.

I didn't get a chance to try other images yet. I've also reproduced this on EC2 instances running Linux although for my testing above I was using Docker for Mac (so probably not true apples to apples but you can see the local/stagemonitor off vs container/stagemonitor off performance is pretty much the same)