Closed meiao closed 2 years ago
@Stephan202 I believe someone in your team/company reached out for support regarding this. If you are interested in testing it, we've got this PR that gave us a small performance improvement on a small test application. Or you can get a custom jar from our action (needs to be logged in on GH) at: https://github.com/newrelic/newrelic-java-agent/actions/runs/2556404809
@meiao great! Indeed, @Ptijohn filed the internal ticket. I'll try to get us to test #895 ~some time this week; stay tuned :)
@Ptijohn and I tested #895 by applying it on top of version 7.8.0. The following graph shows the response times for the key transaction that was most impacted:
So we'd say the code has the desired impact. :heavy_check_mark:
@Stephan202 @Ptijohn We've conducted some more testing on that PR and found out it was not ready for GA. There were some thread hops that were not being captured by our instrumentation, which resulted in loss of visibility for anything that happened after that. One specific case was HTTP requests using Spring WebClient under Netty Reactor. For now, work on this issue will be stopped due to other commitments.
Identified that we will not be able to further address this, without breaking the instrumentation fix for thread hops.
@kford-newrelic this is very unfortunate. It basically means that we'll forever (well, for as long as we use New Relic...) need to deploy a custom fork of the agent with #538 reverted. To be honest, as a customer it's hard to see how that's considered acceptable. We're talking about Spring here, not some obscure framework.
It would be great to understand (in as much technical detail as is necessary) why this issue can't be fixed in the New Relic Agent. We also have a few deployments running the OTEL Agent, and it provides vastly superior support for thread-switching (as in: "it just works"), without any customization and with overhead way less than what can be seen the graph above. (Can't give exact numbers; it's been a while since I tested.)
@Stephan202 we appreciate your feedback and frankly we’d rather fix this issue too but at the moment, with the current design of the agent, we don’t know how to reduce the extra cycles needed to properly lookup/link tokens and still keep the instrumentation needed for it to be correct - and that was after ~3 weeks of dedicated, heads-down, deep focus.
We first noticed that there was an issue with some of our instrumentation because when applications used the Spring WebClient, there were extra thread hops that were unaccounted for that resulted in transactions not properly capturing all the recorded segments. That was just one trigger that we noticed - we assume there could be others that we haven't yet run across. To fix this issue, we had to include code that properly handles the token linking but as your team noticed, those extra compute cycles come with a cost that we just don’t know how to mitigate at the moment.
Description
A customer experienced decreased performance in his application after upgrading from agent 7.4.3 to 7.5.0.
After investigation, he was able to pinpoint the cause to PR #538.
They built a custom version of the 7.5.0 agent without that PR and the performance was similar to 7.4.3.