Decreased performance due to Spring Security instrumentation

newrelic / newrelic-java-agent

The New Relic Java agent

Apache License 2.0

202 stars 143 forks source link

Decreased performance due to Spring Security instrumentation #725

Closed meiao closed 2 years ago

meiao commented 2 years ago

Description

A customer experienced decreased performance in his application after upgrading from agent 7.4.3 to 7.5.0.

After investigation, he was able to pinpoint the cause to PR #538.

They built a custom version of the 7.5.0 agent without that PR and the performance was similar to 7.4.3.

meiao commented 2 years ago

@Stephan202 I believe someone in your team/company reached out for support regarding this. If you are interested in testing it, we've got this PR that gave us a small performance improvement on a small test application. Or you can get a custom jar from our action (needs to be logged in on GH) at: https://github.com/newrelic/newrelic-java-agent/actions/runs/2556404809

Stephan202 commented 2 years ago

@meiao great! Indeed, @Ptijohn filed the internal ticket. I'll try to get us to test #895 ~some time this week; stay tuned :)

Stephan202 commented 2 years ago

@Ptijohn and I tested #895 by applying it on top of version 7.8.0. The following graph shows the response times for the key transaction that was most impacted:

Until 10:40 you see performance of v7.8.0-picnic-1, which is New Relic Agent 7.8.0 without #538.
Between 10:40 and 11:20 you see performance of vanilla New Relic Agent 7.8.0, demonstrating the reported performance regression. (The spike can be ignored; didn't check what caused that.)
From 11:20 onward we deployed version 7.8.0 i.c.w. #895: this brought performance back to the original level.

So we'd say the code has the desired impact. :heavy_check_mark:

meiao commented 2 years ago

@Stephan202 @Ptijohn We've conducted some more testing on that PR and found out it was not ready for GA. There were some thread hops that were not being captured by our instrumentation, which resulted in loss of visibility for anything that happened after that. One specific case was HTTP requests using Spring WebClient under Netty Reactor. For now, work on this issue will be stopped due to other commitments.

kford-newrelic commented 2 years ago

Identified that we will not be able to further address this, without breaking the instrumentation fix for thread hops.

Stephan202 commented 2 years ago

@kford-newrelic this is very unfortunate. It basically means that we'll forever (well, for as long as we use New Relic...) need to deploy a custom fork of the agent with #538 reverted. To be honest, as a customer it's hard to see how that's considered acceptable. We're talking about Spring here, not some obscure framework.

It would be great to understand (in as much technical detail as is necessary) why this issue can't be fixed in the New Relic Agent. We also have a few deployments running the OTEL Agent, and it provides vastly superior support for thread-switching (as in: "it just works"), without any customization and with overhead way less than what can be seen the graph above. (Can't give exact numbers; it's been a while since I tested.)

kford-newrelic commented 2 years ago

@Stephan202 we appreciate your feedback and frankly we’d rather fix this issue too but at the moment, with the current design of the agent, we don’t know how to reduce the extra cycles needed to properly lookup/link tokens and still keep the instrumentation needed for it to be correct - and that was after ~3 weeks of dedicated, heads-down, deep focus.

We first noticed that there was an issue with some of our instrumentation because when applications used the Spring WebClient, there were extra thread hops that were unaccounted for that resulted in transactions not properly capturing all the recorded segments. That was just one trigger that we noticed - we assume there could be others that we haven't yet run across. To fix this issue, we had to include code that properly handles the token linking but as your team noticed, those extra compute cycles come with a cost that we just don’t know how to mitigate at the moment.