linkedin / venice

Venice, Derived Data Platform for Planet-Scale Workloads.
https://venicedb.org
BSD 2-Clause "Simplified" License
487 stars 84 forks source link

[BUG] Thin-client retry can cause deadlock/starvation #1236

Open xunyin8 opened 3 weeks ago

xunyin8 commented 3 weeks ago

Willingness to contribute

Yes. I can contribute a fix for this bug independently.

Venice version

Open source tag >= 0.4.111

System information

Describe the problem

Thin-client retry is vulnerable to dead lock past tag 0.4.111. The client retry enabled via client config retryOnRouterError or retryOnAllErrors will be performing retries using the deserialization threads. This is problematic because the deserialization threads are now making remote calls and waiting on the future. Once all the deserialization threads are exhausted there will be no threads to handle transport response and unblock other deserialization threads that are waiting on remote call. If the request was made without a timeout it will hang forever.

Tracking information

Example trace:

"Venice-Store-Deserialization-t8" daemon prio=5 tid=440 WAITING
    at jdk.internal.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:211)
    at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1864)
    at java.util.concurrent.ForkJoinPool.unmanagedBlock(ForkJoinPool.java:3463)
    at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3434)
    at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1898)
       Local Variable: java.util.concurrent.CompletableFuture$Signaller#41
       Local Variable: java.util.concurrent.CompletableFuture#275
    at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2072)
    at com.linkedin.venice.client.store.RetriableStoreClient.lambda$executeWithRetry$2(RetriableStoreClient.java:79)
       Local Variable: com.linkedin.venice.client.store.RetriableStoreClient$$Lambda$1884+0x0000000801b7aa60#8
       Local Variable: com.linkedin.venice.client.store.SpecificRetriableStoreClient#11
    at com.linkedin.venice.client.store.RetriableStoreClient$$Lambda$1887+0x0000000801b7b2e0.accept(<unknown string>)
    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
       Local Variable: java.util.concurrent.CompletableFuture$AltResult#10
       Local Variable: com.linkedin.venice.client.exceptions.VeniceClientHttpException#8
    at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
       Local Variable: java.util.concurrent.CompletableFuture$UniWhenComplete#8
       Local Variable: java.util.concurrent.CompletableFuture#151
    at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
    at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)
    at com.linkedin.venice.client.store.AbstractAvroStoreClient.lambda$get$3(AbstractAvroStoreClient.java:419)
       Local Variable: java.util.concurrent.CompletableFuture#153
    at com.linkedin.venice.client.store.AbstractAvroStoreClient$$Lambda$1886+0x0000000801b7b0b8.handle(<unknown string>)
    at com.linkedin.venice.client.store.AbstractAvroStoreClient.lambda$requestSubmissionWithStatsHandling$10(AbstractAvroStoreClient.java:589)
    at com.linkedin.venice.client.store.AbstractAvroStoreClient$$Lambda$1789+0x0000000801b2a960.apply(<unknown string>)
    at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934)
    at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911)
       Local Variable: java.util.concurrent.CompletableFuture#154
       Local Variable: java.util.concurrent.CompletableFuture#155
    at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:482)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
       Local Variable: java.util.concurrent.CompletableFuture$UniHandle#8
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
       Local Variable: java.util.concurrent.ThreadPoolExecutor$Worker#137
    at java.lang.Thread.run(Thread.java:833

Code to reproduce bug

No response

What component(s) does this bug affect?