grpc channel not honoring connection timeout

nitinware commented 3 months ago

What version of gRPC-Java are you using? io.grpc:grpc-api:jar:1.45.3

What is your environment? OS: linux (RHEL) and openjdk version "11.0.18" 2023-01-17 LTS

What did you expect to see? We expecting the connection timeout setting being honored for grpc call.

What did you see instead? We are not seeing connection timeout setting being honored for grpc call.

Steps to reproduce the bug

We are are creating GRPC channel as below screenshot and use option .withOption(ChannelOption.CONNECT_TIMEOUT_MILLIS, 1000) setting to set connection timeout
When we make the GRPC call we request timeout for 3000ms using withDeadlineAfter function as below: stub.withDeadlineAfter(deadlineMs, TimeUnit.MILLISECONDS).predict(request);
When the GRPC call is made to server/service which is currently unavailable due to patching activity, we are expecting the grpc call request to honor connection timeout of 10000ms and timeout.
But the grpc call waits for 3000ms and does not timeout at 1000ms, so the connection timeout setting is not honored.

ejona86 commented 3 months ago

The stub deadline applies to the entire life of the RPC. The RPC must complete by 3 seconds in the future. It is completely separate from connection timeout.

Note that the stub deadline is a deadline (a point in time) not a timeout (a duration). You are saying "the call must be done by 3:00 PM" (a point in time), not "the call should complete within 3 seconds" (a duration).

ejona86 commented 3 months ago

Also, 1.45 is over two years old. We aren't going to investigate any bugs with that version. The issue could have been fixed already, or the fix could be completely different.

nitinware commented 3 months ago

@ejona86 appreciate your inputs, can you please share the stable version of grpc so that we will test with given version to check if this issue is addressed.

ejona86 commented 3 months ago

We maintain (provide new releases for) the two most recent minor releases. So those are ideal. At times we might still investigate something for a release a year old, if we think it'd still impact the current release.

In this case, what you described seemed to be the expected behavior.

nitinware commented 3 months ago

@ejona86 Thnx again for ur inputs, appreciate it.

There was typo in timeout specified in my question which I posted earlier, I have edit the same.
here we are setting two timeouts:
- connection timeout: .withOption(ChannelOption.CONNECT_TIMEOUT_MILLIS, 1000)
- request timeout: stub.withDeadlineAfter(deadlineMs, TimeUnit.MILLISECONDS).predict(request);

The connection timeout of 1000ms is not honored as we attempt to connect to server which is undergoing patching and the grpc request waits for 3000ms which is request timeout, rather exiting the call at 1000 ms. Is this expected behavior?

ejona86 commented 3 months ago

CONNECT_TIMEOUT_MILLIS isn't gRPC's API, really. You are poking into Netty, so as far as "expected behavior" it is mostly "whatever it does it does."

What does "server which is undergoing patching" mean? Is the server process not accept()ing connections? Are existing connections preserved?

There are multiple easy ways the RPC could timeout:

The server simply stopped read()ing and processing RPCs during the "patching". The client can already have a connection to the server, so CONNECT_TIMEOUT_MILLIS doesn't come into play
The name resolver returned multiple addresses. gRPC will attempt to connect to each, so if there are 3+ addresses, that might take 3+ seconds
The RPC was delayed some by the connect() which then timed out, and then it found another working server. But then there wasn't enough time for that server to finish the processing

nitinware commented 3 months ago

Thnx again for ur inputs, let me explore the suggestion.
Just for clarity sharing our use case:
- here "server which is undergoing patching" mean, the downstream service is a K8's statefulsets pods which regularly undergoes K8's patching.
- During patching activity each statefulsets pod terminates and spins up on new node.
- Our application makes a grpc call to statefulsets pod and during patching activity for short duration given pod is unavailable (time from: existing pod terminating to new pod spinning up)
- During given time period we are observing that grpc requests from our application to statefulsets pod waits for entire 3000ms and does not timeout at 1000 ms, if it times out earlier we get a chance to re-try another statefulsets pod as our request timeout of 3000ms is fixed. Since the first call itself exhausts all time of 3000ms, we don't have any time left to make re-try another statefulsets pod.

ejona86 commented 3 months ago

You keep saying "10,000 ms" (in the last two comments). Both of those should have been 1000 ms, right?

FWIW, I've not dealt with StatefulSet before.

It sounds like you are using the ordinal name (e.g., myservice-2) to connect to the pod instead of the headless service. So gRPC only sees a single address and doesn't do any load balancing. It also sounds like you are using RollingUpdate.

In all this, I'm assuming RPCs were working on the channel and then they started hanging. There are additional cases possible if this happens on a new channel.

This is pretty pathological for gRPC. When the pod is shut down, gRPC will re-resolve. But it will probably get the old IP address, as the pod is still shutting down. So gRPC will continue using the old IP, and gRPC won't re-resolve from DNS for 30 seconds. When the pod is terminated, the IP no longer exists and (based on your report, I assume) is being black holed. But I'd expect CONNECT_TIMEOUT_MILLIS to improve things.

Definitely better would be to use the headless service, and balance over the various pods. gRPC could then try the various pods itself. Right now it'd still need CONNECT_TIMEOUT_MILLIS. Soon (we released it already but then turned it back off) gRPC will have Happy Eyeballs which can proactively try to connect to additional machines if connect() is taking a little while. Obviously, if your application needs to use the ordinal, then that isn't an option.

Assuming you need to use the ordinal names, I suggest:

Tell us the DEADLINE_EXCEEDED error message you are seeing. We capture useful things in there
Upgrade to a recent version of gRPC. It might behave differently. Depending on the case, it will also add some additional info to the DEADLINE_EXCEEDED error message
If you have a test environment, increase the RPC timeout in hopes that you start seeing an UNAVAILABLE error. Then tell us the UNAVAILABLE error

nitinware commented 3 months ago

@ejona86 Thnx for ur detailed response and appreciate ur persistent help on this issue :

Sorry for confusion on timeouts, here are correct values: connection timeout: 1000ms and request timeout: 3000ms
We need hostname when making calls to named statefulsets pods.
We cannot use headless service we have custom logic on client side where maintain a map of request_param: key value: named sts pods list
If call for a given request param we choose the one of the sts pod from the map and make grpc call, if the given request fails, we need to re-try on other sts pod within that list for given request_param, we cannot use any random sts for request routing.
Sharing stack trace of DEADLINE_EXCEEDED error below:

message: GRPC exception thrown
io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: ClientCall was cancelled at or after deadline. [remote_addr=intent-predictor-statefulset-2.intent-predictor-headless.predictor.svc.cluster.local/<Pod-IP>:7443]
    at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262)
    at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243)
    at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156)
    at com.sf.e.prediction.grpc.PredictionServiceGrpc$PredictionServiceBlockingStub.predict(PredictionServiceGrpc.java:168)
    at com.sf.e.routing.selfserve.grpc.ByocGrpcClient.execute(ByocGrpcClient.java:25)
    at com.sf.e.routing.selfserve.grpc.ByocGrpcClient.predict(ByocGrpcClient.java:29)
    at com.sf.e.routing.selfserve.grpc.ByocGrpcBasedRouter.callGrpcPredictionRequest(ByocGrpcBasedRouter.java:162)
    at com.sf.e.routing.selfserve.grpc.ByocGrpcBasedRouter.lambda$executeSingle$63430cae$1(ByocGrpcBasedRouter.java:111)
    at io.vavr.control.Try.of(Try.java:75)
    at com.sf.e.routing.selfserve.grpc.ByocGrpcBasedRouter.executeSingle(ByocGrpcBasedRouter.java:105)
    at com.sf.e.routing.selfserve.grpc.ByocGrpcBasedRouter.executeSingle(ByocGrpcBasedRouter.java:48)
    at com.sf.e.routing.service.BaseAppTypeRouter.executeSingleWithMetrics(BaseAppTypeRouter.java:366)
    at com.sf.e.routing.service.BaseAppTypeRouter.lambda$routeSingle$97747d4$1(BaseAppTypeRouter.java:221)
    at io.vavr.control.Try.of(Try.java:75)
    at com.sf.e.routing.service.BaseAppTypeRouter.routeSingle(BaseAppTypeRouter.java:212)
    at com.sf.e.routing.service.BaseAppTypeRouter.lambda$executionLogic$19(BaseAppTypeRouter.java:331)
    at com.sf.e.routing.utils.CancellableFuture.execute(CancellableFuture.java:38)
    at com.sf.e.routing.utils.CancellableFuture.lambda$new$0(CancellableFuture.java:28)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)

ejona86 commented 3 months ago

@nitinware, for that DEADLINE_EXCEEDED, the client had a working connection to the server and no client-side buffering.

nitinware commented 3 months ago

@ejona86 sry for dragging this long, we are seeing majority of times UNAVAILABLE error when sts pod is going now and new one is spinning up. Is there a way to set connection timeout for given error such that we don't exhaust our entire request timeout with no time left for retries.

ejona86 commented 3 months ago

@nitinware, there's no evidence that your theory of events is happening. And given you are contacting a single pod, CONNECT_TIMEOUT_MILLIS should work. If your theory was correct, that would show up as waiting_for_connection in DEADLINE_EXCEEDED errors, or very long buffered amounts.

Right now it seems more likely that the pod is just really slow after it is started. If it is written in Java, the first request sometimes taking over 3 seconds is very believable.

nitinware commented 3 months ago

@ejona86 Thnx for inputs, we are not seeing waiting_for_connection in the logs, sharing an update on the issue after enabling istio debug logging.

The grpc call failed coz grpc connection is stale as it is has stale ip of sts pod which got re-started.
We made a fix to refresh the stale grpc connection and all subsequent calls passes.
However, the call made with stale grpc connection is not exiting earlier and exhausting all of it's request timeout.
If it exists earlier, we will get a chance to re-try on another sts pod and we can avoid error for such requests.

ejona86 commented 3 months ago

The grpc call failed coz grpc connection is stale as it is has stale ip of sts pod which got re-started.

As I said earlier, I'd expect the IP to be stale, because gRPC will re-resolve the IP while the pod is still restarting, and so DNS will still return the old IP. No matter what there will be a period where it will use an IP that doesn't work.

If you think the connection is stale, then you aren't shutting down your server properly. The server should close connections (e.g., server.shutdown()) before disappearing.

We made a fix to refresh the stale grpc connection and all subsequent calls passes.

What did you do?

nitinware commented 3 months ago

When the call from grpc channel with stale ip exhausts all it's request timeout, we re-create grpc channel using NettyChannelBuilder as mentioned in original post and then also created blockingstub on the newly created channel using newBlockingStub(Channel channel)
All subsequent calls passes after channel has been re-created.
If we can exit earlier from the call made with stale grpc connection, then we will get enough time to re-try on another sts pod and we can avoid error for such requests.

ejona86 commented 3 months ago

The question you need to investigate is "why is the connection not shut down?" You need to integrate with Kubernete's shutdown processing and to tell the client you are going away. That generally means at least calling server.shutdownNow() and server.awaitTerminated() to wait for a few seconds for things to get cleaned up.

nitinware commented 3 months ago

Not sure if I fully understand, is there any method in grpc java lib which can be used to check if channel underlying connection is active or stale, I tried setting ChannelOption.SO_TIMEOUT but it is not honored as well. Thnx & appreciate ur inputs.

ejona86 commented 3 months ago

I'm saying you should be trying to prevent "stale" connections. Have the server shut down connections. SO_TIMEOUT is only for blocking operations, and we/netty uses nonblocking sockets.

The ways of detecting if a connection is broken need at least 10 seconds before confirming it is broken.

nitinware commented 3 months ago

@ejona86 how to detect if connection is broken which method we need to use to detect broken connection. Can we use notifyWhenStateChanged(..) method to get notified when connection is broken and attempt to re-created the channel. One correction we are using ManagedChannelBuilder to build ManagedChannel as below:

ManagedChannelBuilder.forTarget(finalTarget).intercept(clientInterceptors).usePlaintext();
ManagedChannel channel = builder.build();

ejona86 commented 3 months ago

notifyWhenStateChanged() will tell you when the connection goes away, but that can happen for reasons other than something breaking, and again, you should be avoiding the connection hanging. It will go TRANSIENT_FAILURE if the connect times out.

nitinware commented 3 months ago

Thnx and appreciate ur inputs, I tried logging grpc channel ConnectivityState b4 making the grpc call, after running the test observed that for failed requests as well the grpc ConnectivityState is READY it's not TRANSIENT_FAILURE -

ConnectivityState state = this.channel.getState(false);
logger.info("ByocGrpcClient::execute() state={}", state.toString());

ejona86 commented 3 months ago

That means either gRPC reconnected or hasn't had enough time to determine the connection is dead (most likely this). And that means you haven't done as I recommended and shut down the server when the pod shuts down.

nitinware commented 3 months ago

We cannot shutdown our server when downstream sts pods restarts, our server/service has SLA's which might be impacted. Also our server/service is a routing service which routes requests to multiple downstream services if one of them restarts then we cannot restart whole server as it might impact other un-related downstream services traffic.

ejona86 commented 3 months ago

No, shut down the downstream sts server when the pod restarts. I'm saying the problem is in the server that your channel is connected to.

nitinware commented 3 months ago

@ejona86 pardon my ignorance, not sure if I fully understood by shut down the server, sharing some background on sts services.

sts services are deployed in k8's cluster as statefulsets pods and they pods are deployed on k8's nodes on the given cluster.
when the regular k8s patching runs's it drains node for patching, when node is drained these sts pods on given node are terminated and spin up on new node of k8's cluster.
when the pod on drained node goes down, we are seeing above error where grpc ConnectivityState is READY most likely drained node is still reachable.
we don't have control on these nodes residing on k8's cluster.

ejona86 commented 3 months ago

we don't have control on these nodes residing on k8's cluster.

The server is misbehaving. I can't help if you have zero control and can't inform them that it is broken. You should accept the latency.

ejona86 commented 3 months ago

Seems like this is resolved as best as it can be. Detecting TCP connection failures takes seconds. The server should shut down its connections instead of going into a black hole. If it is not resolved and there's more we can help with, comment, and it can be reopened.

grpc / grpc-java