grpc / grpc-java

The Java gRPC implementation. HTTP/2 based RPC
https://grpc.io/docs/languages/java/
Apache License 2.0
11.37k stars 3.82k forks source link

grpc channel not honoring connection timeout #11268

Closed nitinware closed 3 months ago

nitinware commented 3 months ago

What version of gRPC-Java are you using? io.grpc:grpc-api:jar:1.45.3

What is your environment? OS: linux (RHEL) and openjdk version "11.0.18" 2023-01-17 LTS

What did you expect to see? We expecting the connection timeout setting being honored for grpc call.

What did you see instead? We are not seeing connection timeout setting being honored for grpc call.

Steps to reproduce the bug

ejona86 commented 3 months ago

The stub deadline applies to the entire life of the RPC. The RPC must complete by 3 seconds in the future. It is completely separate from connection timeout.

Note that the stub deadline is a deadline (a point in time) not a timeout (a duration). You are saying "the call must be done by 3:00 PM" (a point in time), not "the call should complete within 3 seconds" (a duration).

ejona86 commented 3 months ago

Also, 1.45 is over two years old. We aren't going to investigate any bugs with that version. The issue could have been fixed already, or the fix could be completely different.

nitinware commented 3 months ago

@ejona86 appreciate your inputs, can you please share the stable version of grpc so that we will test with given version to check if this issue is addressed.

ejona86 commented 3 months ago

We maintain (provide new releases for) the two most recent minor releases. So those are ideal. At times we might still investigate something for a release a year old, if we think it'd still impact the current release.

In this case, what you described seemed to be the expected behavior.

nitinware commented 3 months ago

@ejona86 Thnx again for ur inputs, appreciate it.

The connection timeout of 1000ms is not honored as we attempt to connect to server which is undergoing patching and the grpc request waits for 3000ms which is request timeout, rather exiting the call at 1000 ms. Is this expected behavior?

ejona86 commented 3 months ago

CONNECT_TIMEOUT_MILLIS isn't gRPC's API, really. You are poking into Netty, so as far as "expected behavior" it is mostly "whatever it does it does."

What does "server which is undergoing patching" mean? Is the server process not accept()ing connections? Are existing connections preserved?

There are multiple easy ways the RPC could timeout:

nitinware commented 3 months ago
ejona86 commented 3 months ago

You keep saying "10,000 ms" (in the last two comments). Both of those should have been 1000 ms, right?

FWIW, I've not dealt with StatefulSet before.

It sounds like you are using the ordinal name (e.g., myservice-2) to connect to the pod instead of the headless service. So gRPC only sees a single address and doesn't do any load balancing. It also sounds like you are using RollingUpdate.

In all this, I'm assuming RPCs were working on the channel and then they started hanging. There are additional cases possible if this happens on a new channel.

This is pretty pathological for gRPC. When the pod is shut down, gRPC will re-resolve. But it will probably get the old IP address, as the pod is still shutting down. So gRPC will continue using the old IP, and gRPC won't re-resolve from DNS for 30 seconds. When the pod is terminated, the IP no longer exists and (based on your report, I assume) is being black holed. But I'd expect CONNECT_TIMEOUT_MILLIS to improve things.

Definitely better would be to use the headless service, and balance over the various pods. gRPC could then try the various pods itself. Right now it'd still need CONNECT_TIMEOUT_MILLIS. Soon (we released it already but then turned it back off) gRPC will have Happy Eyeballs which can proactively try to connect to additional machines if connect() is taking a little while. Obviously, if your application needs to use the ordinal, then that isn't an option.

Assuming you need to use the ordinal names, I suggest:

nitinware commented 3 months ago

@ejona86 Thnx for ur detailed response and appreciate ur persistent help on this issue :

message: GRPC exception thrown
io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: ClientCall was cancelled at or after deadline. [remote_addr=intent-predictor-statefulset-2.intent-predictor-headless.predictor.svc.cluster.local/<Pod-IP>:7443]
    at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262)
    at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243)
    at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156)
    at com.sf.e.prediction.grpc.PredictionServiceGrpc$PredictionServiceBlockingStub.predict(PredictionServiceGrpc.java:168)
    at com.sf.e.routing.selfserve.grpc.ByocGrpcClient.execute(ByocGrpcClient.java:25)
    at com.sf.e.routing.selfserve.grpc.ByocGrpcClient.predict(ByocGrpcClient.java:29)
    at com.sf.e.routing.selfserve.grpc.ByocGrpcBasedRouter.callGrpcPredictionRequest(ByocGrpcBasedRouter.java:162)
    at com.sf.e.routing.selfserve.grpc.ByocGrpcBasedRouter.lambda$executeSingle$63430cae$1(ByocGrpcBasedRouter.java:111)
    at io.vavr.control.Try.of(Try.java:75)
    at com.sf.e.routing.selfserve.grpc.ByocGrpcBasedRouter.executeSingle(ByocGrpcBasedRouter.java:105)
    at com.sf.e.routing.selfserve.grpc.ByocGrpcBasedRouter.executeSingle(ByocGrpcBasedRouter.java:48)
    at com.sf.e.routing.service.BaseAppTypeRouter.executeSingleWithMetrics(BaseAppTypeRouter.java:366)
    at com.sf.e.routing.service.BaseAppTypeRouter.lambda$routeSingle$97747d4$1(BaseAppTypeRouter.java:221)
    at io.vavr.control.Try.of(Try.java:75)
    at com.sf.e.routing.service.BaseAppTypeRouter.routeSingle(BaseAppTypeRouter.java:212)
    at com.sf.e.routing.service.BaseAppTypeRouter.lambda$executionLogic$19(BaseAppTypeRouter.java:331)
    at com.sf.e.routing.utils.CancellableFuture.execute(CancellableFuture.java:38)
    at com.sf.e.routing.utils.CancellableFuture.lambda$new$0(CancellableFuture.java:28)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
ejona86 commented 3 months ago

@nitinware, for that DEADLINE_EXCEEDED, the client had a working connection to the server and no client-side buffering.

nitinware commented 3 months ago

@ejona86 sry for dragging this long, we are seeing majority of times UNAVAILABLE error when sts pod is going now and new one is spinning up. Is there a way to set connection timeout for given error such that we don't exhaust our entire request timeout with no time left for retries.

ejona86 commented 3 months ago

@nitinware, there's no evidence that your theory of events is happening. And given you are contacting a single pod, CONNECT_TIMEOUT_MILLIS should work. If your theory was correct, that would show up as waiting_for_connection in DEADLINE_EXCEEDED errors, or very long buffered amounts.

Right now it seems more likely that the pod is just really slow after it is started. If it is written in Java, the first request sometimes taking over 3 seconds is very believable.

nitinware commented 3 months ago

@ejona86 Thnx for inputs, we are not seeing waiting_for_connection in the logs, sharing an update on the issue after enabling istio debug logging.

ejona86 commented 3 months ago

The grpc call failed coz grpc connection is stale as it is has stale ip of sts pod which got re-started.

As I said earlier, I'd expect the IP to be stale, because gRPC will re-resolve the IP while the pod is still restarting, and so DNS will still return the old IP. No matter what there will be a period where it will use an IP that doesn't work.

If you think the connection is stale, then you aren't shutting down your server properly. The server should close connections (e.g., server.shutdown()) before disappearing.

We made a fix to refresh the stale grpc connection and all subsequent calls passes.

What did you do?

nitinware commented 3 months ago
ejona86 commented 3 months ago

The question you need to investigate is "why is the connection not shut down?" You need to integrate with Kubernete's shutdown processing and to tell the client you are going away. That generally means at least calling server.shutdownNow() and server.awaitTerminated() to wait for a few seconds for things to get cleaned up.

nitinware commented 3 months ago

Not sure if I fully understand, is there any method in grpc java lib which can be used to check if channel underlying connection is active or stale, I tried setting ChannelOption.SO_TIMEOUT but it is not honored as well. Thnx & appreciate ur inputs.

ejona86 commented 3 months ago

I'm saying you should be trying to prevent "stale" connections. Have the server shut down connections. SO_TIMEOUT is only for blocking operations, and we/netty uses nonblocking sockets.

The ways of detecting if a connection is broken need at least 10 seconds before confirming it is broken.

nitinware commented 3 months ago

@ejona86 how to detect if connection is broken which method we need to use to detect broken connection. Can we use notifyWhenStateChanged(..) method to get notified when connection is broken and attempt to re-created the channel. One correction we are using ManagedChannelBuilder to build ManagedChannel as below:

ManagedChannelBuilder.forTarget(finalTarget).intercept(clientInterceptors).usePlaintext();
ManagedChannel channel = builder.build();
ejona86 commented 3 months ago

notifyWhenStateChanged() will tell you when the connection goes away, but that can happen for reasons other than something breaking, and again, you should be avoiding the connection hanging. It will go TRANSIENT_FAILURE if the connect times out.

nitinware commented 3 months ago

Thnx and appreciate ur inputs, I tried logging grpc channel ConnectivityState b4 making the grpc call, after running the test observed that for failed requests as well the grpc ConnectivityState is READY it's not TRANSIENT_FAILURE -

ConnectivityState state = this.channel.getState(false);
logger.info("ByocGrpcClient::execute() state={}", state.toString());
image
ejona86 commented 3 months ago

That means either gRPC reconnected or hasn't had enough time to determine the connection is dead (most likely this). And that means you haven't done as I recommended and shut down the server when the pod shuts down.

nitinware commented 3 months ago

We cannot shutdown our server when downstream sts pods restarts, our server/service has SLA's which might be impacted. Also our server/service is a routing service which routes requests to multiple downstream services if one of them restarts then we cannot restart whole server as it might impact other un-related downstream services traffic.

ejona86 commented 3 months ago

No, shut down the downstream sts server when the pod restarts. I'm saying the problem is in the server that your channel is connected to.

nitinware commented 3 months ago

@ejona86 pardon my ignorance, not sure if I fully understood by shut down the server, sharing some background on sts services.

ejona86 commented 3 months ago

we don't have control on these nodes residing on k8's cluster.

The server is misbehaving. I can't help if you have zero control and can't inform them that it is broken. You should accept the latency.

ejona86 commented 3 months ago

Seems like this is resolved as best as it can be. Detecting TCP connection failures takes seconds. The server should shut down its connections instead of going into a black hole. If it is not resolved and there's more we can help with, comment, and it can be reopened.