hyperium / tonic

A native gRPC client & server implementation with async/await support.
https://docs.rs/tonic
MIT License
10.02k stars 1.02k forks source link

Support GRPC Keepalive without calls #258

Closed dangerousplay closed 4 years ago

dangerousplay commented 4 years ago

Feature Request

Hello, the GRPC has it's own way of doing keepalive, besides you could use the SO_KEEPALIVE flag on the TCP socket. Currently, the GRPC HTTP 2 keepalive is not implemented. The ping part of the protocol can be found here. Here is the documentation of the GRPC part on this.

The keepalive ping is a way to check if a channel is currently working by sending HTTP2 pings over the transport. It is sent periodically, and if the ping is not acknowledged by the peer within a certain timeout period, the transport is disconnected.

Motivation

Without keepalive, long idle connections will be dropped, especially if you have, for instance, a network load balancer that doesn't know the protocol used.

Proposal

Implement the GRPC oficial keepalive solution using HTTP 2 ping frames. The server needs to schedule those ping commands and handle the ACK (and the client).

GRPC - https://github.com/grpc/grpc/blob/master/doc/keepalive.md HTTP 2 - https://http2.github.io/http2-spec/#PING

From the GRPC FAC:

FAQ

  • When is the keepalive timer started?
  • The keepalive timer is started when a transport is done connecting (after handshake).
  • What happens when the keepalive timer fires?
  • When the keepalive timer fires, gRPC Core will try to send a keepalive ping on the transport. This ping can be blocked if -
  • there is no active call on that transport and GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS is false.
  • the number of pings already sent on the transport without any data has already exceeded GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA.
  • the time elapsed since the previous ping is less than GRPC_ARG_HTTP2_MIN_SENT_PING_INTERVAL_WITHOUT_DATA_MS.
  • If a keepalive ping is not blocked and is sent on the transport, then the keepalive watchdog timer is started which will close the transport if the ping is not acknowledged before it fires.
  • Why am I receiving a GOAWAY with error code ENHANCE_YOUR_CALM?
  • A server sends a GOAWAY with ENHANCE_YOUR_CALM if the client sends too many misbehaving pings. For example -
  • if a server has GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS set to false and the client sends pings without there being any call in flight.
  • if the client's GRPC_ARG_HTTP2_MIN_SENT_PING_INTERVAL_WITHOUT_DATA_MS setting is lower than the server's GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS.
LucioFranco commented 4 years ago

Looking into this a bit more it looks like h2 doesn't support sending custom data on the keepalive. https://docs.rs/h2/0.2.1/h2/struct.Ping.html

The other problem here will be that hyper doens't support ping either. So I would think for this custom behavior we would want to have a custom transport. But I don't like the idea of maintaining something that is not hyper. It might be worth it to open an issue for custom pings in the hyper repo.

Ten0 commented 4 years ago

+1 This would currently be a blocker for us for moving from grpc-rs.

LucioFranco commented 4 years ago

I am a bit short on time today but the next steps would be to open the corresponding issues in h2 and hyper around custom pings. The problem I see here that will make this really hard is that we have not figured out how to send a keepalive ping via tower-service. Most likely we will have to offload some of this to hyper.

cc @seanmonstar

seanmonstar commented 4 years ago

Some more details:

LucioFranco commented 4 years ago

I did a little searching this is what I found:

https://github.com/grpc/grpc-go/blob/6b9bf4296edc5fae722a5dff887a954ffc599b12/internal/transport/http2_client.go#L1325

https://github.com/grpc/grpc-java/blob/a408d086bd12ea9049424de3ce325cc65420c231/netty/src/main/java/io/grpc/netty/NettyServerHandler.java#L103

https://github.com/grpc/grpc-java/blob/a408d086bd12ea9049424de3ce325cc65420c231/netty/src/main/java/io/grpc/netty/NettyServerHandler.java#L871

So very inconclusive.

Yeah, I think it makes sense for hyper to expose some sort of config option for this. H3 still seems far away but we can introduce it as an http2 option?

LucioFranco commented 4 years ago

Also looks like we may need to echo exactly what is pinged to us in the ping frame.

Reference: https://github.com/grpc/grpc/blob/master/doc/PROTOCOL-HTTP2.md#ping-frame

seanmonstar commented 4 years ago

h2 already automatically responds to PING frames received over the wire.

seanmonstar commented 4 years ago

Relevant PR in hyper providing generic HTTP2 keep-alive support: https://github.com/hyperium/hyper/pull/2151

seanmonstar commented 4 years ago

hyper v0.13.4 now includes options to set some HTTP2 keep-alive options. So, next would be taking advantage of them in tonic.

LucioFranco commented 4 years ago

This has been done in #307

dfreese commented 3 years ago

I was testing this out, and it does seem like there is a slight mismatch from the spec in terms of error codes returned. In hyperium/hyper#2151, KeepAliveTimedOut from h2 seems to get mapped to h2::Reason::INTERNAL_ERROR, but from here:

An expired client initiated PING will cause all calls to be closed with an UNAVAILABLE status. Note that the frequency of PINGs is highly dependent on the network environment, implementations are free to adjust PING frequency based on network and application requirements.

The error I was seeing was:

[status: Internal, message: "h2 protocol error: protocol error: unexpected internal error encountered", details: [], metadata: MetadataMap { headers: {} }]

Does tonic get the information it needs to be able to map that to UNAVAILABLE, or would that require a change in h2?

LucioFranco commented 3 years ago

@seanmonstar ^ we likely want to forward that?

seanmonstar commented 3 years ago

We could maybe get hyper's error type to have is_timeout() return true in that case also, and then maybe tonic could look for that? Or maybe tonic doesn't have that tight coupling with hyper.

dfreese commented 3 years ago

I was able to take a closer look at hyper::Error and how it might integrate with tonic::Status. I'll follow up with that in https://github.com/hyperium/tonic/pull/629 once I've had a chance to clean up some of the tests. It seems like there is a reasonable way to handle both the is_timeout() case and connection errors that I was seeing in that PR.

fabianbergmark commented 1 year ago

How can the tonic server be configured to allow PERMIT_KEEPALIVE_WITHOUT_CALLS? Currently I get RESOURCE_EXHAUSTED: HTTP/2 error code: ENHANCE_YOUR_CALM (Bandwidth exhausted) when I configure my client with keepAliveWithoutCalls = true