How to catch Errno::ETIMEDOUT?

mat commented 8 years ago

I am a bit at a loss because we've been running into what I consider somewhat of a thread-safety/isolation issue. We have an AMQP queue based architecture that we use to send out pushes (to APN and other push systems) where we are seeing a lot of these errors:

Errno::ETIMEDOUT(Connection timed out):
  /opt/ruby-2.3.1/lib/ruby/2.3.0/openssl/buffering.rb:178:in `sysread_nonblock'
  /opt/ruby-2.3.1/lib/ruby/2.3.0/openssl/buffering.rb:178:in `read_nonblock'
  net-http2 (0.14.0) lib/net-http2/client.rb:116:in `block in socket_loop'
  net-http2 (0.14.0) lib/net-http2/client.rb:113:in `loop'
  net-http2 (0.14.0) lib/net-http2/client.rb:113:in `socket_loop'
  net-http2 (0.14.0) lib/net-http2/client.rb:93:in `block (2 levels) in ensure_open'

(this is the full trace)

The problem is this is happening in unrelated processors and making them fail hard, like the GCMProcessor as an example. To give you a very rough idea we are effectively using one connection per processor:

class Http2PushConnection
  @connections = {}

  def self.send_push(notification)
    response = nil
    Retry.retry_on_exception(max_retries: 2, wait_s: 0.2) do
      get_connection(notification.notification_env) do |connection|
        response = connection.push(notification, timeout: 2 * CONNECT_TIMEOUT)
      end
    end
    response
  end

  def self.get_connection(env)
    @connections[env] = establish_connection(env) unless @connections.key?(env)

    conn = @connections[env]
    yield(conn)
    if close_connection?(env)
      conn.close
      @connections.delete(env)
    end
  end
end

class ApnsHttp2Processor < AmqpProcessor
  def process
    notification = build_notification(...)
    Http2Connection.send_push(notification)
  end
end

class GCMProcessor < AmqpProcessor
  def process
    // Send push to Google's GCM service, for example
  end
end

I don't know how to handle this one in the application layer.

Catching it: not possible as far as I can see
Retrying: not possible because some of our processors are not (and cannot be fully) idempotent

Any ideas? Any input is welcome.

ostinelli commented 7 years ago

Hello @mat, as per the README:

In case that errors are encountered, Apnotic will raise the error and repair the underlying connection, but it will not retry the requests that have failed. This is by design, so that the job manager (Sidekiq, Resque,...) can retry the job that failed. For this reason, it is recommended to use a queue engine that will retry unsuccessful pushes.

Does this clears your question?

ostinelli commented 7 years ago

Hello @mat, A similar request (https://github.com/ostinelli/apnotic/issues/45) has been opened.

I've provided a possible fix (discussed in https://github.com/ostinelli/net-http2/issues/4), can you try out the branch https://github.com/ostinelli/net-http2/tree/error_callback and see if that covers your needs?

ostinelli commented 7 years ago

Released Apnotic 1.1.0 to cover this. Thank you for your feedback.

mat commented 7 years ago

Thanks @ostinelli and everyone involved in finding a solution for this, should make it possible to fix our production setup 😃

ostinelli / apnotic

How to catch Errno::ETIMEDOUT? #38