Yes. I can contribute a fix for this bug independently.
Venice version
Open source tag >= 0.4.111
System information
OS Platform and Distribution (e.g., Linux Ubuntu 20.0): N/A
JDK version: All
Describe the problem
Thin-client retry is vulnerable to dead lock past tag 0.4.111. The client retry enabled via client config retryOnRouterError or retryOnAllErrors will be performing retries using the deserialization threads. This is problematic because the deserialization threads are now making remote calls and waiting on the future. Once all the deserialization threads are exhausted there will be no threads to handle transport response and unblock other deserialization threads that are waiting on remote call. If the request was made without a timeout it will hang forever.
Tracking information
Example trace:
"Venice-Store-Deserialization-t8" daemon prio=5 tid=440 WAITING
at jdk.internal.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:211)
at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1864)
at java.util.concurrent.ForkJoinPool.unmanagedBlock(ForkJoinPool.java:3463)
at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3434)
at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1898)
Local Variable: java.util.concurrent.CompletableFuture$Signaller#41
Local Variable: java.util.concurrent.CompletableFuture#275
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2072)
at com.linkedin.venice.client.store.RetriableStoreClient.lambda$executeWithRetry$2(RetriableStoreClient.java:79)
Local Variable: com.linkedin.venice.client.store.RetriableStoreClient$$Lambda$1884+0x0000000801b7aa60#8
Local Variable: com.linkedin.venice.client.store.SpecificRetriableStoreClient#11
at com.linkedin.venice.client.store.RetriableStoreClient$$Lambda$1887+0x0000000801b7b2e0.accept(<unknown string>)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
Local Variable: java.util.concurrent.CompletableFuture$AltResult#10
Local Variable: com.linkedin.venice.client.exceptions.VeniceClientHttpException#8
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
Local Variable: java.util.concurrent.CompletableFuture$UniWhenComplete#8
Local Variable: java.util.concurrent.CompletableFuture#151
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)
at com.linkedin.venice.client.store.AbstractAvroStoreClient.lambda$get$3(AbstractAvroStoreClient.java:419)
Local Variable: java.util.concurrent.CompletableFuture#153
at com.linkedin.venice.client.store.AbstractAvroStoreClient$$Lambda$1886+0x0000000801b7b0b8.handle(<unknown string>)
at com.linkedin.venice.client.store.AbstractAvroStoreClient.lambda$requestSubmissionWithStatsHandling$10(AbstractAvroStoreClient.java:589)
at com.linkedin.venice.client.store.AbstractAvroStoreClient$$Lambda$1789+0x0000000801b2a960.apply(<unknown string>)
at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934)
at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911)
Local Variable: java.util.concurrent.CompletableFuture#154
Local Variable: java.util.concurrent.CompletableFuture#155
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:482)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
Local Variable: java.util.concurrent.CompletableFuture$UniHandle#8
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
Local Variable: java.util.concurrent.ThreadPoolExecutor$Worker#137
at java.lang.Thread.run(Thread.java:833
Code to reproduce bug
No response
What component(s) does this bug affect?
[ ] Controller: This is the control-plane for Venice. Used to create/update/query stores and their metadata.
[ ] Router: This is the stateless query-routing layer for serving read requests.
[ ] Server: This is the component that persists all the store data.
[ ] VenicePushJob: This is the component that pushes derived data from Hadoop to Venice backend.
[ ] VenicePulsarSink: This is a Sink connector for Apache Pulsar that pushes data from Pulsar into Venice.
[X] Thin Client: This is a stateless client users use to query Venice Router for reading store data.
[ ] Fast Client: This is a stateful client users use to query Venice Server for reading store data.
[ ] Da Vinci Client: This is an embedded, stateful client that materializes store data locally.
[ ] Alpini: This is the framework that fast-client and routers use to route requests to the storage nodes that have the data.
[ ] Samza: This is the library users use to make nearline updates to store data.
[ ] Admin Tool: This is the stand-alone client used for ad-hoc operations on Venice.
[ ] Scripts: These are the various ops scripts in the repo.
Willingness to contribute
Yes. I can contribute a fix for this bug independently.
Venice version
Open source tag >= 0.4.111
System information
Describe the problem
Thin-client retry is vulnerable to dead lock past tag 0.4.111. The client retry enabled via client config
retryOnRouterError
orretryOnAllErrors
will be performing retries using the deserialization threads. This is problematic because the deserialization threads are now making remote calls and waiting on the future. Once all the deserialization threads are exhausted there will be no threads to handle transport response and unblock other deserialization threads that are waiting on remote call. If the request was made without a timeout it will hang forever.Tracking information
Example trace:
Code to reproduce bug
No response
What component(s) does this bug affect?
Controller
: This is the control-plane for Venice. Used to create/update/query stores and their metadata.Router
: This is the stateless query-routing layer for serving read requests.Server
: This is the component that persists all the store data.VenicePushJob
: This is the component that pushes derived data from Hadoop to Venice backend.VenicePulsarSink
: This is a Sink connector for Apache Pulsar that pushes data from Pulsar into Venice.Thin Client
: This is a stateless client users use to query Venice Router for reading store data.Fast Client
: This is a stateful client users use to query Venice Server for reading store data.Da Vinci Client
: This is an embedded, stateful client that materializes store data locally.Alpini
: This is the framework that fast-client and routers use to route requests to the storage nodes that have the data.Samza
: This is the library users use to make nearline updates to store data.Admin Tool
: This is the stand-alone client used for ad-hoc operations on Venice.Scripts
: These are the various ops scripts in the repo.