Closed ospaarmann closed 6 years ago
Try setting a lower keepalive value.
Hey @essen,
thank you for the suggestion. I tried lowering the keepalive
value in the http2
options. This caused the error to appear more often. With the default value of keepalive (5000)
it appeared roughly every 20 seconds. Increasing it to keepalive (10000)
let to the error appearing every 40 seconds.
Now I set keepalive to :infinity
. This seems to fix the issue. Are there any downsides to this that you are aware of?
Thanks
Sounds like there's a bug either in Gun's ping message or in the server.
The downside is that this disables pings entirely and so you might not know when a server is gone until the write buffers are full.
What's the server?
True that @essen. I will contact the devs from Dgraph and report back if they can identify a bug so that we can narrow it down.
Server is Dgraph / https://github.com/dgraph-io/dgraph .
My guess is that this is rather a bug in gun then. The reason for my suspicion is that Dgraph has clients in Go, Java, Javascript, Python, etc. and they don't seem to have this issue.
Feel free to double check the code:
Would be good to have server logs for why the connection was dropped, or to trace the modules gun and gun_http2 to get that same reason.
The server doesn't write anything to the logs when the connection ends.
How would I go about tracing gun and gun_http2 in an Elixir application?
In Erlang:
dbg:start().
dbg:tracer().
dbg:tpl(gun, []).
dbg:tpl(gun_http2, []).
dbg:p(all ,c).
Thanks. Elixir was very similar since you can call Erlang modules with :module.fun()
. So
:dbg.start()
:dbg.tracer()
:dbg.tpl(:gun, [])
:dbg.tpl(:gun_http2, [])
:dbg.p(:all ,:c)
The raw output of the trace. I in between are the gun_down messages from my client: https://gist.github.com/ospaarmann/c592fa984f775810b2698bf0b4d82228
Server is dropping the connection with {goaway,0,enhance_your_calm,<<"too_many_pings">>}
.
I suppose 5 seconds is too much? Not sure why, but I guess setting a larger value would fix your issue.
Cheers.
Thank you for the fast support! I will try that.
@ospaarmann did you ever fix this without setting keepalive
to infinity
?
For what it's worth the default in 2.0 will be infinity
. See https://github.com/ninenines/gun/commit/556645d8d73b1de8ba453ee2fba2bc659a7962f6
@essen Yeah, I've been trying to find out more info all day and finally getting into some stuff that makes sense. As you probably expected, it seems it's likely just a mismatch in the keepalive
settings between gun's default and grpc server defaults.
Example from grpc core there is lots of good info: https://grpc.github.io/grpc/cpp/md_doc_keepalive.html
I think it probably has to do with:
GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS
- If there are no data/header frames being sent on the transport, this channel argument on the server side controls the minimum time (in milliseconds) that gRPC Core would expect between receiving successive pings. If the time between successive pings is less that than this time, then the ping will be considered a bad ping from the peer. Such a ping counts as a ‘ping strike’. On the client side, this does not have any effect.
And from the FAQ, the last bullet point in the last question:
FAQ
- Why am I receiving a GOAWAY with error code ENHANCE_YOUR_CALM?
- A server sends a GOAWAY with ENHANCE_YOUR_CALM if the client sends too many misbehaving pings. For example:
- if a server has GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS set to false and the client sends pings without there being any call in flight.
- if the client's GRPC_ARG_HTTP2_MIN_SENT_PING_INTERVAL_WITHOUT_DATA_MS setting is lower than the server's GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS.
Channel Argument | Client | Server |
---|---|---|
GRPC_ARG_KEEPALIVE_TIME_MS | INT_MAX (disabled) | 7200000 (2 hours) |
GRPC_ARG_KEEPALIVE_TIMEOUT_MS | 20000 (20 seconds) | 20000 (20 seconds) |
GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS | 0 (false) | 0 (false) |
GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA | 2 | 2 |
GRPC_ARG_HTTP2_MIN_SENT_PING_INTERVAL_WITHOUT_DATA_MS | 300000 (5 minutes) | 300000 (5 minutes) |
GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS | N/A | 300000 (5 minutes) |
GRPC_ARG_HTTP2_MAX_PING_STRIKES | N/A | 2 |
And if servers are going to have these kinds of defaults it's not worth doing pings without prior knowledge that they're going to work.
Hey,
I use the Elixir gRPC implementation https://github.com/tony612/grpc-elixir to build a Dgraph client written in Elixir (https://github.com/ospaarmann/exdgraph). grpc-elixir in turn uses gun (
1.0.0-pre.5
).My connection keeps breaking every X seconds with
gun_down
and reason normal.I don't see anything in the server Dgraph server logs. The connection isn't using ssl. Interestingly it happens very regularly every 5 or so seconds so.
I am a bit lost on how to debug this and I have little experience with Erlang, only Elixir so far. My apologies that I cannot provide more information right now but I can deliver whatever helps.
Any hint or help is greatly appreciated! ❤️