google / guava

Google core libraries for Java
Apache License 2.0
50.2k stars 10.91k forks source link

remove `synchronized` from `LocalCache::get(key, loader)` to allow for `VirtualThread`-friendly value-loading #6845

Closed cortlepp closed 11 months ago

cortlepp commented 11 months ago

API(s)

com.google.common.cache.LocalCache$Segment.lockedGetOrLoad

How do you want it to be improved?

This load should be possible without using synchronized.

Why do we need it to be improved?

Using synchronized here is not performant when the current Thread is a Java 21 VirtualThread and the loader does some blocking operation, because the synchronized prevents the JVM from unmounting the Thread. The performance could be drastically improved by using p.e. ReentrantLock.

Example

Thread.ofVirtual().start( () -> {
  Database database;
  var databaseLoader = new CacheLoader<String, Object>() {
        @Override
        public Object load(String key) {
            return database.load(key); // loads an object by querying a database over the network 
        }
    };
var cache = CacheBuilder.newBuilder().build(databaseLoader);
var fooObject = cache.getOrLoad("foo"); // cache miss, invokes the loader
});

Current Behavior

In the above example, the loader is invoked. This invocation happens inside a synchronized block in com.google.common.cache.LocalCache$Segment.lockedGetOrLoad. Because the loader does I/O but is inside the synchronized block the VirtualThread can not be unmounted and therefore blocks its carrier. This is bad for the throughput of the application. This behavior can be observed using any loader that does a blocking operation when it is invoked from a VirtualThread (just start your JVM with -Djdk.tracePinnedThreads=full). The performance impact is dependent on how long the loader takes, but since generally one only caches things that are expensive to do I believe this to be a problem for pretty much anyone using LocalCache in combination with VirtualThreads.

Desired Behavior

com.google.common.cache.LocalCache$Segment.lockedGetOrLoad should work gracefully with VirtualThreads and not block the carrier Thread if a loader does I/O or some other blocking operation.

The method should therefore be refactored to lock using some other method, p.e. a ReentrantLock.

Concrete Use Cases

We use a LocalCache to cache frequently used database objects. Our loader therefore executes a SQL query over the network, which is a blocking operation and takes comparatively long. We are currently experimenting with using VirtualThreads in order to improve the throughput of I/O heavy workloads.

Checklist

cortlepp commented 11 months ago

I would be happy to work on a PR for this myself, If someone can explain to me what exactly the synchronized is currently trying to prevent. The comment suggests that otherwise something can load "recursively". How can one recursively call this method? And how does a synchronized prevent this, can't it just at best only delay it?

I have been trying to wrap my head around this for some time now (without success), and I can't find a test case that covers this.

cpovirk commented 11 months ago

The fix was made before common.cache was open-sourced, so I can't link to a specific commit for it.

The fix cited https://github.com/google/guava/issues/369. It added the following test, which was removed as part of a switch to "white-box" tests, again before the code was open-sourced:

    public void testRecursiveDuplicateComputation() throws InterruptedException {
      final class RecursiveFunction implements Function<String, String> {
        Map<String, String> map;
        @Override public String apply(String key) {
          return map.get(key);
        }
      }
      final RecursiveFunction computeFunction = new RecursiveFunction();
      final ConcurrentMap<String, String> map = new MapMaker()
          .makeComputingMap(computeFunction);
      computeFunction.map = map;

      // tells the test when the compution has completed
      final CountDownLatch doneSignal = new CountDownLatch(1);

      Thread thread = new Thread() {
        @Override public void run() {
          try {
            map.get("foo");
          } finally {
            doneSignal.countDown();
          }
        }
      };
      thread.setUncaughtExceptionHandler(new UncaughtExceptionHandler() {
        @Override public void uncaughtException(Thread t, Throwable e) {}
      });
      thread.start();

      boolean done = doneSignal.await(1, TimeUnit.SECONDS);
      if (!done) {
        StringBuilder builder = new StringBuilder();
        for (StackTraceElement trace : thread.getStackTrace()) {
          builder.append("\tat ").append(trace).append('\n');
        }
        fail(builder.toString());
      }
    }

The diff to the prod code was:

@@ -17,6 +17,7 @@
 package com.google.common.collect;

 import static com.google.common.base.Preconditions.checkNotNull;
+import static com.google.common.base.Preconditions.checkState;

 import com.google.common.base.Equivalence;
 import com.google.common.base.Function;
@@ -119,7 +120,13 @@
             // This thread solely created the entry.
             boolean success = false;
             try {
-              V value = computingValueReference.compute(key, hash);
+              V value = null;
+              // Synchronizes on the entry to allow failing fast when a
+              // recursive computation is detected. This is not full-proof
+              // since the entry may be copied when the segment is written to.
+              synchronized (entry) {
+                value = computingValueReference.compute(key, hash);
+              }
               checkNotNull(value, "compute() returned null unexpectedly");
               success = true;
               return value;
@@ -136,6 +143,7 @@
         try {
           while (true) {
             try {
+              checkState(!Thread.holdsLock(entry), "Recursive computation");
               ValueReference<K, V> valueReference = entry.getValueReference();
               V value = valueReference.waitForValue();
               if (value == null) {

The holdsLock check still exists. I assume that that's what the lock is for.

Perhaps we should be using a ThreadLocal instead? (ThreadLocal also comes up in discussions of "things to avoid when using virtual threads." But I think the concern there is with things like ThreadLocal<byte[]> that can hold lots of memory. So if all we did was set a boolean and then remove it afterward, we'd likely be fine.)

All that said: Our usual recommendation is to use Caffeine instead of our caches. Caffeine has so far taken the position that it's not worth restructuring the code to support virtual threads because eventually synchronized won't force threads to be pinned (https://github.com/ben-manes/caffeine/issues/1018, https://github.com/ben-manes/caffeine/pull/860).

I don't know whether we'd act on a PR here or not. On the one hand, I'm shocked that this is the only use of synchronized in common.cache, so eliminating it might actually provide meaningful improvements. (I gather from your comment about -Djdk.tracePinnedThreads=full that you haven't seen other pinning, such an inside JDK libraries that common.cache uses?) On the other hand, we really don't want to encourage people to move away from Caffeine, nor give the impression that Guava is virtual-thread-friendly on the whole. (Notably, we have one of those ThreadLocal<char> instances in escape.Platform.)

ben-manes commented 11 months ago

My recollection is that this was added in one of my first bug fixes when learning that code. The original author was on sabbatical and no one else knew the code very well, but a commonly reported footgun were deadlocks in production due to recursive loads. I didn't want to use a ThreadLocal because those had a reputation for performance issues (multiple rewrites). The alternative was to stash the thread id in the loading entry as extra bookkeeping, but that needed care for proper clean up. I advocated for the holdsLock trick because an uncontended lock is free, it had no extra memory (header bit), there were no observed degradations, and it was a very low risk change to unfamiliar code. We then unofficially supported recursive loads despite discouraging them, and future discussions wrt ConcurrentHashMap facing the same problem. (My name might not be on the internal commit bit due to my manager having preferred that our team ships CLs to others to avoid perceptions that 20% projects might distract from the core project's timeline)

Virtual thread support in mutexes is on going work (latest status). It may have been premature to have released the feature in Java 21 without it, as it is very easy to deadlock in JDK only code due to ample usage of synchronization throughout the standard library. The answer isn't to remove all of that and switching to ReentrantLock was only meant as advice for those wanting to experiment for feedback purposes. I wouldn't use VTs in production until these footguns are resolved, which will likely happen before most are ready to upgrade to 21 anyway.

I still stand by my position in Caffeine that it is not worth the effort and instead the JDK team will solve it soonish. It is good to be aware of, but primarily as a reason to avoid adoption of VTs until they are more robust.

cortlepp commented 11 months ago

Ok a couple of things first just so I understand correctly:

Regarding your question about other pinning @cpovirk : So far I haven't discovered anything else, but I'm still working my way through it.

Regarding the possible fix using a ThreadLocal: I also think this would work, and I would also the agree that it should not be a problem performance/memory wise using either platform or virtual threads (and since we are on the slow path anyways I at least would not worry too much about it). Even though It pains me a bit to be using a ThreadLocal, I think in this case, since we can just clean everything up inside the finally this should really be manageable (ThreadLocal creation and destruction would only be ~10 loc apart). If we just put a static object inside the ThreadLocal (and then check whether it is there or not) the memory footprint should also be fairly minimal. For what it's worth I think this would be a fairly small fix that would bring a lot of value to everyone who already wants to adopt VirtualThreads, all of their problems aside.

The only question for me that remains is: would you consider a PR for this? I will probably end up implementing either the ThreadLocal solution or just throw out the synchronized in my fork (I am doing this for a thesis, so I don't absolutely need a real released and permanent solution, but I would definitely prefer one).

But I also absolutely understand your point though @ben-manes . I just happen to need a solution now, so I kind of just need to deal with all the quirks that VirtualThreads currently have ^^.

cortlepp commented 11 months ago

I went ahead and implemented the ThreadLocal solution in my fork. Unlike I said yesterday though I didn't put a static object inside the ThreadLocal but the entry itself, because otherwise we would also fail on the (maybe a bit weird) legal case that someone tries to load a different key in their loader.

An edgecase that this implementation would not detect is if one tries to recursively load the same entry, but with another "normal" loader in the middle. In that case the threadlocal would be reset by the second loader, and the third one could no longer determine that loader nr. 1 already tried to load this entry. But this really seems very niche to me.

I also wrote a test for it. Please do let me know If you would like me to submit a PR, or if you have other suggestions or comments.

ben-manes commented 11 months ago

Unfortunately circular recursive calls (a -> b -> a) is not unheard of for longer dependency chains. The recursion is rarely intentional but rather a side-effect of calling an application api that circles back eventually. A fast fail is much better than a deadlock since the developer won’t understand at first what went wrong.

Thus, the ThreadLocal would need to capture the list of loading entries. That’s still not horrible or difficult to implement.

Alternatively the loading reference entry would need to capture the thread. This entry type is temporary and swapped out once complete. That would avoid the entry copying concerns that the doc comment mentions since the copy could capture the thread as needed. This is probably the ideal approach, just a bit more work and think time to avoid regressions.

If you can use Caffeine, then AsyncCache is compatible as it defers the compute to a future so the monitor is held briefly for establishing the mapping. The executor could be set to a VT and the synchronous view used for convenience.

cortlepp commented 11 months ago

@ben-manes I updated my implementation to remeber the loading Thread like you suggested and added a test for the "proxied recursive load" scenario that this is supposed to fix (and it seems to be working for me locally). I didn't find a copying of the loading entry though, so I didn't adjust anything there. Also @cpovirk what do you think about this now?

Thanks for the hint, I'm going to have a look at it. But if possible I would still prefer to fix the issue in guava. We use guava caches quite extensively, and I'm guessing that switching to caffeine would require a pretty major rework.

ben-manes commented 11 months ago

Those changes look good. From a quick review it doesn't seem this entry type needs to be copied (copyEntry / copyFor return itself) so presumably the original doc comment may have been an incorrect concern.

Caffeine is mostly a drop-in replacement for Guava's cache with minor differences. The performance difference can be quite stark, see below for a zipfian workload on my macbook with 16 threads and varying Guava's concurrencyLevel (defaults to 4). Whether that translates to realizable gains is application specific, so you might consider capturing a jfr profile to see if the caches remain a bottleneck after your fix.

Screenshot 2023-11-24 at 1 21 07 PM
cortlepp commented 11 months ago

Ok great. Can I open a PR for this now?

Cool. A migration to Caffeine is definitely on our radar and will happen at some point.

cpovirk commented 11 months ago

No guarantees, but I'd say that opening a PR is likely to be worth your time. I can have a look at it and run some tests. Thanks.

cpovirk commented 10 months ago

The fix is now released as part of 33.0.0.

cpovirk commented 7 months ago

...and, as https://github.com/google/guava/pull/6851#issuecomment-2009725201 notes, the fix was rolled back for 33.1.0.

Given that the fix was really more of a workaround for a temporary issue with virtual threads (on which there has been progress), and given that we continue to recommend Caffeine over Guava's cache, and given that our previous fix caused problems, I don't anticipate that we'll work on this further.