newrelic / newrelic-java-agent

The New Relic Java agent
Apache License 2.0
202 stars 143 forks source link

Instrumentation is bugged for Slick 3.5.0 #1996

Closed jmreardon-yoppworks closed 2 months ago

jmreardon-yoppworks commented 3 months ago

Description

I am trying to upgrade a service to Slick 3.5.0. When I run this service with the updated Slick & New Relic Java agent 8.13.0, some Slick queries just…give no response. I believe the instrumentation needs an update.

Expected Behavior

Previously, the service depended on Slick 3.3.3 and worked with the agent enabled.

Troubleshooting or NR Diag results

Steps to Reproduce

So far I have only observed the issue in a large Akka service. Adding

  class_transformer:

    excludes:
      slick/.*

to my configuration results in the service behaving as expected.

Your Environment

JVM 11, Akka 2.9.3, Slick 3.5.0, MySQL or H2 database.

workato-integration[bot] commented 3 months ago

https://new-relic.atlassian.net/browse/NR-297102

jmreardon-yoppworks commented 3 months ago

Looking a bit further, excluding specifically slick/util/AsyncExecutor.* is enough. This class was rewritten in Slick 3.5.0

kanderson250 commented 3 months ago

@jmreardon-yoppworks Thanks for the report. To confirm- were you running successfully with agent 8.13 and Slick 3.3.x, and it was the upgrade to slick 3.5.0 that broke things for you?

jmreardon-yoppworks commented 3 months ago

@kanderson250 , that's correct.

kanderson250 commented 3 months ago

@jmreardon-yoppworks okay, thank you.

To help us repro and debug this, it would be great to get a little more information about the slick queries that are problematic. Are the stalling queries consistent, ie are they stalling every time? Or are they running sometimes... and then getting hung up? Is the stall affecting some queries, and not others?

If you're able to provide a pared-down code sample, that would be very helpful.

jmreardon-yoppworks commented 3 months ago

@kanderson250 I tried testing New Relic + some Slick demo queries and am not seeing any failures. The actually queries are being made by the Akka Persistence JDBC library, so it's a bit beneath the level of code I actually control.

kanderson250 commented 3 months ago

@jmreardon-yoppworks Okay, thanks for the update. I also wasn't seeing failures with straight slick queries. Will continue to investigate.

jmreardon-yoppworks commented 2 months ago

@kanderson250 , I was able to get a reproducer working! The issue happens when there are more calls being made than database connections: https://github.com/jmreardon-yoppworks/new-relic-bug-1996

jmreardon-yoppworks commented 2 months ago

I suspect this is related to the PrioritizedRunnable added in 3.5.0 (https://github.com/slick/slick/blob/main/slick/src/main/scala/slick/util/AsyncExecutor.scala#L29), conflicting with the New Relic Runnable wrapper. The implementation does a type check which I think the NR wrapper is defeating becasue it wraps that runnable in a NewRelicRunnable (https://github.com/slick/slick/blob/main/slick/src/main/scala/slick/util/AsyncExecutor.scala#L280)

kanderson250 commented 2 months ago

@jmreardon-yoppworks Awesome!! Thank you for taking the time to continue sleuthing and figure out this repro! Will take a look and let you know how it goes.

kanderson250 commented 2 months ago

We found the bug - leaving some comments here in case this resurfaces at a later time.

The cause of the stall was our NewRelicAsyncExecutor wrapper. Internally (as of 3.5.0), Slick always implements AsyncExecutor as AsyncExecutorImpl. Our wrapper was effectively upcasting AsyncExecutorImpl back to AsyncExecutor, which led to numerous concurrency issues.

In more detail, this problem manifested because of ManagedArrayBlockingQueue. The queue relies on the AsyncExecutorImpl implementation of PrioritizedRunnable to correctly increment and decrement its number of in-progress tasks. Our NewRelicAsyncExecutor wrapper ignored this implementation, instead falling through to the default PrioritizedRunnables that incremented the queue's in-progress counter without decrementing the counter (though the runnables themselves did complete). Eventually, the in-progress counter exceeded the number of max connections and locked the queue into a paused state.

The previously mentioned NewRelicRunnable wrapper was also a problem (just not, it turns out, the biggest problem).

jmreardon-yoppworks commented 2 months ago

@kanderson250 , any idea when a fix for these issues will land?

kanderson250 commented 2 months ago

@jmreardon-yoppworks Yep, we have a fix developed. We weren't able to get it in to our next release 8.14.0 (slated for this week) so that means it will be going into 8.15.0 several weeks from now.

If you're interested, here is a snapshot build with the fix (scroll to Artifacts at the bottom of the Summary page for the custom-newrelic-jar download link): https://github.com/newrelic/newrelic-java-agent/actions/runs/10494664569

You're welcome to try out this snapshot in a test env and see if it works for you.

jmreardon-yoppworks commented 2 months ago

@kanderson250 snapshot jar works in my test env! Will keep an eye out for when the fix lands. Thanks!