googleapis / sdk-platform-java

Tooling and shared libraries for Cloud SDK for Java
https://cloud.google.com/java/docs/bom
Apache License 2.0
66 stars 54 forks source link

feat: Enforce RPC deadlines even when GRPC does not #1319

Open dpcollins-google opened 2 years ago

dpcollins-google commented 2 years ago

Environment details

Steps to reproduce

  1. Create a singleton RPC call for the Pub/Sub Lite cursor service using the blocking generated GAX surface, or using the futures surface followed by a call to get()
  2. It blocks forever (or at least 100 hours) in some edge case, despite a timeout of 300 seconds in the service configuration.
chanseokoh commented 2 years ago

Is this a regression, only manifested in new library or dependency versions? And although it sounds like this is not something easily reproducible, any small sample or snippet that demonstrates this?

dpcollins-google commented 2 years ago

This is unclear, but I had not experienced this in the past, so it is likely a recent (O(months) though) regression.

An example code snippet which triggered this from the apache beam repo is:

CursorServiceClient newCursorServiceClient() { ... }

newCursorServiceClient()
    .commitCursor(
        CommitCursorRequest.newBuilder()
                    .setSubscription(options.subscriptionPath().toString())
                    .setPartition(partition.value())
                    .setCursor(Cursor.newBuilder().setOffset(offset.value()))
                    .build());
chanseokoh commented 2 years ago

I see the transport of pubsublite v1 is gRPC. @vam-google any thoughts?

vam-google commented 2 years ago

@chanseokoh There are no other clients besides java-compute depending on rest transport right now. So it is safe to ssume that all reported issues, if they are not compute related are gRPC.

dpcollins-google commented 2 years ago

I just created my own pipeline- I'm able to recreate this fairly frequently, where the future takes over a minute to finish. It has the following (truncated) stacktrace:

java.util.concurrent.TimeoutException: Waited 1 minutes (plus 834188 nanoseconds delay) for com.google.api.gax.retrying.CallbackChainRetryingFuture@32bd4ca3[status=PENDING]
    at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:527)
    at org.apache.beam.sdk.io.gcp.pubsublite.internal.SubscriberAssembler.lambda$getCommitter$0(SubscriberAssembler.java:106)
    at org.apache.beam.sdk.io.gcp.pubsublite.internal.PerSubscriptionPartitionSdf.lambda$processElement$0(PerSubscriptionPartitionSdf.java:88)
    at java.base/java.util.Optional.ifPresent(Optional.java:183)
    at org.apache.beam.sdk.io.gcp.pubsublite.internal.PerSubscriptionPartitionSdf.processElement(PerSubscriptionPartitionSdf.java:84)
    at org.apache.beam.sdk.io.gcp.pubsublite.internal.PerSubscriptionPartitionSdf$DoFnInvoker.invokeProcessElement(Unknown Source)
    ...
AlanGasperini commented 2 years ago

P1 out of SLO, please take a look & triage

dpcollins-google commented 2 years ago

To provide more information, it appears that in this case the issue is with executor exhaustion at the GRPC layer preventing the grpc future from ever returning. However, it would be useful to enforce deadlines on the gax future (i.e. complete it early) even if GRPC never completes the request.

meltsufin commented 2 years ago

@dpcollins-google Is there a corresponding issue filed against gRPC? Also, can we change this to a feature request and downgrade the priority? Thanks!

meltsufin commented 2 years ago

I checked with @dpcollins-google offline and we agreed to change this to a feature request and downgrade to p2.