newrelic / newrelic-java-agent

The New Relic Java agent
Apache License 2.0
196 stars 141 forks source link

Investigate whether Caffeine BoundedLocalCache can be made non-blocking #1711

Closed jasonjkeller closed 2 days ago

jasonjkeller commented 5 months ago

About

Investigate if a Caffeine BoundedLocalCache can be made non-blocking or whether using an UnboundedLocalCache has non-blocking behavior, so that the ExtensionHolderImpl doesn't potentially lead to thread contention under heavy usage.

https://github.com/ben-manes/caffeine/blob/2bcb7ca5260cfc266ed81bc31939cf3cadaadc67/caffeine/src/main/java/com/github/benmanes/caffeine/cache/Caffeine.java#L1016-L1024

Problem

When instrumentation modules add an @NewField to a class the new fields don't actually get woven into the class but rather are stored in a cache that is keyed by an instance of the cache. These new fields are obtained by calling ExtensionHolderImpl.getExtension which uses an com.github.benmanes.caffeine.cache.BoundedLocalCache under the hood. Periodically BoundedLocalCache.performCleanUp is called to clean up cached entries for key references that have gone out of scope. In some cases where a large number of object instances are being created and cached this clean up action can cause blocking.

The stack trace below shows one such scenario related to a large number of PreparedStatement's being instrumented, each of which generates a preparedSql new field.

https://github.com/newrelic/newrelic-java-agent/blob/0eeef2826a7acd0144b19be22732cab430d43de7/instrumentation/jdbc-generic/src/main/java/java/sql/PreparedStatement_Weaved.java#L28-L29

"FikenJetty-174" prio=5 Id=174 RUNNABLE
        at java.base@21.0.1/java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:940)
        at com.newrelic.agent.deps.com.github.benmanes.caffeine.cache.BoundedLocalCache.drainKeyReferences(BoundedLocalCache.java:1510)
        at com.newrelic.agent.deps.com.github.benmanes.caffeine.cache.BoundedLocalCache.maintenance(BoundedLocalCache.java:1488)
        at com.newrelic.agent.deps.com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(BoundedLocalCache.java:1460)
        at com.newrelic.agent.deps.com.github.benmanes.caffeine.cache.BoundedLocalCache$PerformCleanupTask.run(BoundedLocalCache.java:3359)
        at com.newrelic.agent.instrumentation.weaver.extension.ExtensionHolderFactoryImpl$ExtensionHolderImpl$$Lambda/0x00007ff4fc381428.execute(Unknown Source)
        at com.newrelic.agent.deps.com.github.benmanes.caffeine.cache.BoundedLocalCache.scheduleDrainBuffers(BoundedLocalCache.java:1432)
        at com.newrelic.agent.deps.com.github.benmanes.caffeine.cache.BoundedLocalCache.scheduleAfterWrite(BoundedLocalCache.java:1399)
        at com.newrelic.agent.deps.com.github.benmanes.caffeine.cache.BoundedLocalCache.afterWrite(BoundedLocalCache.java:1370)
        at com.newrelic.agent.deps.com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2471)
        at com.newrelic.agent.deps.com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2387)
        at com.newrelic.agent.deps.com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:108)
        at com.newrelic.agent.deps.com.github.benmanes.caffeine.cache.LocalManualCache.get(LocalManualCache.java:62)
        at com.newrelic.agent.instrumentation.weaver.extension.ExtensionHolderFactoryImpl$ExtensionHolderImpl.getExtension(ExtensionHolderFactoryImpl.java:50)
        at app//com.newrelic.weave.java.sql.PreparedStatement_891077813_nr_ext.getExtension(CaffeineBackedExtensionClass.java:31)
        at app//org.postgresql.jdbc.PgConnection.prepareStatement(PgConnection.java:1379)
        at app//org.postgresql.jdbc.PgConnection.prepareStatement(PgConnection.java:1827)
        at app//org.postgresql.jdbc.PgConnection.prepareStatement(PgConnection.java:523)
        at app//com.zaxxer.hikari.pool.ProxyConnection.prepareStatement(ProxyConnection.java:327)
        at app//com.zaxxer.hikari.pool.HikariProxyConnection.prepareStatement(HikariProxyConnection.java)
        ...

        Number of locked synchronizers = 1
        - java.util.concurrent.locks.ReentrantLock$NonfairSync@559a7e85

Related

workato-integration[bot] commented 5 months ago

https://new-relic.atlassian.net/browse/NR-219742

ben-manes commented 5 months ago

I believe this project uses a caller runs executor on its cache, whereas by default it uses ForkJoinPool. I think this was due to a bug in some jdk versions that broke the executor if on a single cpu docker container.

In some cases Caffeine has thresholds on its work to amortize the cost across threads. While we could do that more liberally, it seems odd to have such a high count that it’s a bottleneck. I’d expect either that should be fixed or a profiler should confirm the actual problem.

meiao commented 5 months ago

Yes, we added that workaround due to a bug in Java 17 that when running on a machine with 1 CPU, the ForkJoinPool would have 0 threads. We probably should revert that, because as it currently stands, the cleanup happens in a thread that is processing a request, making that request take longer than it should.

But I don't think that reverting to the ForkJoinPool would solve the problem in this case. Because there is a lock during the cleanup that seems to prevent other threads from reading/writing.

We are still investigating.

ben-manes commented 5 months ago

While the lock won't block reads, it will eventually block writers. A write buffer is used to try to absorb a burst of writes to avoid serializing those threads and will acquire the eviction lock to assist if the write buffer is full. You should see this error in the logs if an eviction appears stuck,

The cache is experiencing excessive wait times for acquiring the eviction lock. This may indicate that a long-running computation has halted eviction when trying to remove the victim entry. Consider using AsyncCache to decouple the computation from the map operation.

The stacktrace is odd because the ConcurrentHashMap.get is very cheap. That loop dequeues from a ReferenceQueue, obtains the entry, and tries to evict it if present by performing a compute-based removal. If there are millions of pending weak references to evict then I suppose that stack trace might show up, but even then that is the fast part. Usually the compute is slow because it collided with an expensive load in the same hash bin. In those cases it is encouraged to have the value loader compute outside of the map's lock, e.g. by storing a memoizing supplier or future value instead (faq might help).

kanderson250 commented 3 weeks ago

Original GTSE: https://new-relic.atlassian.net/browse/NR-219675

kanderson250 commented 2 days ago

We repro’ed the issue with an app which added hundreds of millions of entries into the cache.

We looked extensively at tweaking various aspects of our cache. This included using an async cache as suggested in the caffeine faq, reverting the executor to the default, bumping the initial capacity to mitigate resizing issues, and trying manual instead of atomic writes. None of these tweaks made a significant impact to the performance.

Using an unbounded cache is not possible because we require weak key references and an eviction policy to prevent massive memory overhead (we did try this - and as expected, the increased memory use shot the system completely).

Ultimately, we determined that the blocking behavior is unavoidable by the agent in situations with an extremely high number of writes into the cache. Changing this behavior would entail a lower-level overhaul of the agent. Marking as a won’t fix.

ben-manes commented 2 days ago

If there are any details to feedback up to the Caffeine project, either public or private, please feel welcome to share.

There was a hint in the thread that millions of objects might be weakly evictable at once and that was causing the blocking behavior as it was cleaned up in a single pass. If that is the case and it needs to be amortized over multiple cache calls, like we do in other cases, then that could be added to the library. Presently I haven't heard that as a problem, and reverting to the default executor implies its not the case since the eviction would be async. Thus, I'm not familiar enough with the performance issue observed to provide any further support. If it does become something worth revisiting, please feel welcome to reach out.