ninenines / gun

HTTP/1.1, HTTP/2, Websocket client (and more) for Erlang/OTP.
ISC License
898 stars 231 forks source link

Connection keeps breaking with gun_down #158

Closed ospaarmann closed 6 years ago

ospaarmann commented 6 years ago

Hey,

I use the Elixir gRPC implementation https://github.com/tony612/grpc-elixir to build a Dgraph client written in Elixir (https://github.com/ospaarmann/exdgraph). grpc-elixir in turn uses gun (1.0.0-pre.5).

My connection keeps breaking every X seconds with gun_down and reason normal.

[error] ExDgraph.Protocol #PID<0.280.0> received unexpected message: {:gun_down, #PID<0.288.0>, :http2, :normal, [], []}

I don't see anything in the server Dgraph server logs. The connection isn't using ssl. Interestingly it happens very regularly every 5 or so seconds so.

I am a bit lost on how to debug this and I have little experience with Erlang, only Elixir so far. My apologies that I cannot provide more information right now but I can deliver whatever helps.

Any hint or help is greatly appreciated! ❤️

essen commented 6 years ago

Try setting a lower keepalive value.

ospaarmann commented 6 years ago

Hey @essen,

thank you for the suggestion. I tried lowering the keepalive value in the http2 options. This caused the error to appear more often. With the default value of keepalive (5000) it appeared roughly every 20 seconds. Increasing it to keepalive (10000) let to the error appearing every 40 seconds.

Now I set keepalive to :infinity. This seems to fix the issue. Are there any downsides to this that you are aware of?

Thanks

essen commented 6 years ago

Sounds like there's a bug either in Gun's ping message or in the server.

The downside is that this disables pings entirely and so you might not know when a server is gone until the write buffers are full.

essen commented 6 years ago

What's the server?

ospaarmann commented 6 years ago

True that @essen. I will contact the devs from Dgraph and report back if they can identify a bug so that we can narrow it down.

ospaarmann commented 6 years ago

Server is Dgraph / https://github.com/dgraph-io/dgraph .

ospaarmann commented 6 years ago

My guess is that this is rather a bug in gun then. The reason for my suspicion is that Dgraph has clients in Go, Java, Javascript, Python, etc. and they don't seem to have this issue.

essen commented 6 years ago

Feel free to double check the code:

Would be good to have server logs for why the connection was dropped, or to trace the modules gun and gun_http2 to get that same reason.

ospaarmann commented 6 years ago

The server doesn't write anything to the logs when the connection ends.

How would I go about tracing gun and gun_http2 in an Elixir application?

essen commented 6 years ago

In Erlang:

dbg:start().
dbg:tracer().
dbg:tpl(gun, []).
dbg:tpl(gun_http2, []).
dbg:p(all ,c).
ospaarmann commented 6 years ago

Thanks. Elixir was very similar since you can call Erlang modules with :module.fun(). So

:dbg.start()
:dbg.tracer()
:dbg.tpl(:gun, [])
:dbg.tpl(:gun_http2, [])
:dbg.p(:all ,:c)

The raw output of the trace. I in between are the gun_down messages from my client: https://gist.github.com/ospaarmann/c592fa984f775810b2698bf0b4d82228

essen commented 6 years ago

Server is dropping the connection with {goaway,0,enhance_your_calm,<<"too_many_pings">>}.

I suppose 5 seconds is too much? Not sure why, but I guess setting a larger value would fix your issue.

Cheers.

ospaarmann commented 6 years ago

Thank you for the fast support! I will try that.

ryanwinchester commented 3 years ago

@ospaarmann did you ever fix this without setting keepalive to infinity?

essen commented 3 years ago

For what it's worth the default in 2.0 will be infinity. See https://github.com/ninenines/gun/commit/556645d8d73b1de8ba453ee2fba2bc659a7962f6

ryanwinchester commented 3 years ago

@essen Yeah, I've been trying to find out more info all day and finally getting into some stuff that makes sense. As you probably expected, it seems it's likely just a mismatch in the keepalive settings between gun's default and grpc server defaults.

Example from grpc core there is lots of good info: https://grpc.github.io/grpc/cpp/md_doc_keepalive.html

I think it probably has to do with:

GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS
  • If there are no data/header frames being sent on the transport, this channel argument on the server side controls the minimum time (in milliseconds) that gRPC Core would expect between receiving successive pings. If the time between successive pings is less that than this time, then the ping will be considered a bad ping from the peer. Such a ping counts as a ‘ping strike’. On the client side, this does not have any effect.

And from the FAQ, the last bullet point in the last question:

FAQ

  • Why am I receiving a GOAWAY with error code ENHANCE_YOUR_CALM?
    • A server sends a GOAWAY with ENHANCE_YOUR_CALM if the client sends too many misbehaving pings. For example:
      • if a server has GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS set to false and the client sends pings without there being any call in flight.
      • if the client's GRPC_ARG_HTTP2_MIN_SENT_PING_INTERVAL_WITHOUT_DATA_MS setting is lower than the server's GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS.

Defaults

Channel Argument Client Server
GRPC_ARG_KEEPALIVE_TIME_MS INT_MAX (disabled) 7200000 (2 hours)
GRPC_ARG_KEEPALIVE_TIMEOUT_MS 20000 (20 seconds) 20000 (20 seconds)
GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS 0 (false) 0 (false)
GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA 2 2
GRPC_ARG_HTTP2_MIN_SENT_PING_INTERVAL_WITHOUT_DATA_MS 300000 (5 minutes) 300000 (5 minutes)
GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS N/A 300000 (5 minutes)
GRPC_ARG_HTTP2_MAX_PING_STRIKES N/A 2
essen commented 3 years ago

And if servers are going to have these kinds of defaults it's not worth doing pings without prior knowledge that they're going to work.