ostinelli / apnotic

A Ruby APNs HTTP/2 gem able to provide instant feedback.
MIT License
479 stars 94 forks source link

Error handling for broken connections #129

Open Linuus opened 6 months ago

Linuus commented 6 months ago

Hi!

We're trying out your gem to send VoIP notifications, using Sidekiq. We are having some issues though with broken connections.

At first we were raising an error in the connection.on(:error) {} callback, like this:

      Apnotic::ConnectionPool.new(connection_config, size: 5) do |connection|
        connection.on(:error) do |exception|
          raise(PushNotification::Error, "Production APNs connection error: #{exception}")
        end
      end

That was a really bad idea since it crashed all of Sidekiq making it restart. We fixed this and now we're just reporting to our error service instead.

      Apnotic::ConnectionPool.new(connection_config, size: 5) do |connection|
        connection.on(:error) do |exception|
          Sentry.capture_exception(exception)
        end
      end

Now, occasionally we get this error reported:

Errno::ECONNRESET: Connection reset by peer
  from openssl (3.2.0) lib/openssl/buffering.rb:211:in `sysread_nonblock'
  from openssl (3.2.0) lib/openssl/buffering.rb:211:in `read_nonblock'
  from net-http2 (0.18.5) lib/net-http2/client.rb:145:in `block in socket_loop'
  from net-http2 (0.18.5) lib/net-http2/client.rb:142:in `loop'
  from net-http2 (0.18.5) lib/net-http2/client.rb:142:in `socket_loop'
  from net-http2 (0.18.5) lib/net-http2/client.rb:114:in `block (2 levels) in ensure_open'

It's reported in the callback and then 60s later we get a timeout here:

    connection_pool(ios_voip_push_token).with do |connection|
      response = connection.push(apnotic_notification(notification, ios_voip_push_token))
      raise(TimeoutError) if response.nil?
      [...]
    end

I guess we can pass a shorter timeout to the push method to lower this timeout, since it seems fairly high.

Anyway, when this happened it started happening a lot. Almost all our pushes got this connection reset error. Our push jobs are not retried, but I don't think this would help either since the connections seems to not be "healed".

Could there be an issue where connections are stuck in a broken state? Or are we supposed to handle these errors differently?