Handling connection errors

dbrgn commented 6 years ago

In the README you write:

The connection is meant to kept up when having constant traffic and should stop Apple's DDOS blocks. Sometimes one might experience TimeoutErrors or ConnectionErrors, so keeping track of connections and when having errors, restarting them is a good idea.

Is there a way to get notified about connection errors (e.g. when the HTTP2 connection gets severed) without sending a push?

Is there another way of restarting the connection than re-creating a Client?

Would be great if these things could be explained in the README.

pimeys commented 6 years ago

It would be nice to think some other ways.

How I see it, the connection error handling should happen in hyper's connection pool. And for now, the connections have been super stable, occasionally getting a TimeoutError or ConnectionError. We've been testing and monitoring to see if any problems happen, and how we've solved this was just monitoring the results and letting RAII to drop the problematic connection if errors are frequent enough, then creating a new one.

Again, I need to rethink the README now when we have more ideas how the library works in real life.

pimeys commented 6 years ago

So to answer all of your questions in one place, many of these gotchas were because we were not really sure how this all will work out with the experimental hyper and we did big precautions to not break things to much, and I was taking a big risk to allow testing the new library on production.

It all worked out quite well. I still don't really know what are those ConnectionErrors, and TimeoutErrors are quite random. Apple says if you get zero bytes back as a response, you should form a new connection, retry the notification and all notifications sent after the one having a zero-byte response. These might be deployments or just plain errors on Apple's side.

Here we don't go so low level. How I understand it, when we start getting ConnectionErrors for notifications, we should reopen a new connection. Right now the easiest choice was to just drop the notifier thread and create a new one, having a connection reset might work and might be a cleaner solution.

Just noting, that this kind of errors were more common when we first went to production, but after certain fixes for hyper and h2, I don't see them so much anymore.

Summarizing this that in the end, you should monitor the responses and you should have some kind of retry strategy for certain errors. If you re-use the same apns-id, you don't need to worry about duplicates anyways.

pimeys commented 6 years ago

Oh, one more thing:

These errors are a problem when you have a decent amount of traffic through your connection. We have a keep-alive timeout of 10 minutes in Hyper, so a stale connection with no traffic will get dropped and a new one opened for the next request.

dbrgn commented 6 years ago

Yes, it seems that stale connections will be automatically re-established:

DEBUG 2018-04-05T12:56:35Z: hyper::client: client connection error: protocol error: not a result of an error
DEBUG 2018-04-05T12:56:35Z: hyper::client::dns: resolving host="api.development.push.apple.com", port=443
DEBUG 2018-04-05T12:56:35Z: hyper::client::connect: connecting to 17.188.138.73:443
TRACE 2018-04-05T12:56:35Z: apns2::alpn: AlpnConnector::call got TCP, trying TLS
DEBUG 2018-04-05T12:56:36Z: hyper::client::pool: pooling idle connection for ("https://api.development.push.apple.com", Http2)

However, when I kill the established connection with tcpkill, the first push fails:

DEBUG 2018-04-05T13:01:51Z: hyper::client::pool: reuse idle connection for "https://api.development.push.apple.com"
DEBUG 2018-04-05T13:01:51Z: hyper::client: client connection error: connection reset
TRACE 2018-04-05T13:01:51Z: apns2::client: Request error: an operation was canceled internally before starting
 WARN 2018-04-05T13:01:51Z: push_relay::server: Error: Push message could not be processed: Push was unsuccessful: Error connecting to APNs
DEBUG 2018-04-05T13:01:51Z: hyper::proto::h1::io: flushed 155 bytes

However, the next push works again:

DEBUG 2018-04-05T13:04:01Z: hyper::client::dns: resolving host="api.development.push.apple.com", port=443
DEBUG 2018-04-05T13:04:01Z: hyper::client::connect: connecting to 17.188.138.73:443
TRACE 2018-04-05T13:04:02Z: apns2::alpn: AlpnConnector::call got TCP, trying TLS
DEBUG 2018-04-05T13:04:02Z: hyper::client::pool: pooling idle connection for ("https://api.development.push.apple.com", Http2)

So the only case when the user needs to reestablish a connection is when the connection is still up but the apns2 client instance returns a ConnectionError or TimeoutError, right?

pimeys commented 6 years ago

I'd say so, yes. You get those time to time, and I hold a strategy now to track about 5 timeout errors per minute to kick a thread out from the supervisor and create a new one. For connection errors it goes out from the first error. Those are quite rare now, so I haven't had an urge to change those.

In the end, monitoring connections on this level is something that shouldn't be done, and I'd let hyper to deal with bigger problems. You don't really want to write all that boilerplate to your consumer/service/whatever.

dbrgn commented 6 years ago

Ok, thanks for clarification :)

reown-com / a2

Handling connection errors #14