ostinelli / apnotic

A Ruby APNs HTTP/2 gem able to provide instant feedback.
MIT License
479 stars 94 forks source link

push_async executing forever #75

Closed brlo closed 3 years ago

brlo commented 6 years ago

Hello. I use Apnotic push_async in Sidekiq. The fix from this issue #68 resolve problem when Sidekiq is crushed because of exception in main thread. But from time to time one of my 10 Sidekiq workers is stuck forever with job where we need to send a push async. I've explored this problem and found what if SocketError happened, for example, here in socket_loop https://github.com/ostinelli/net-http2/blob/master/lib/net-http2/client.rb#L142 (to reproduce just raise it here) next try of push_async will stuck here https://github.com/ostinelli/apnotic/blob/master/lib/apnotic/connection.rb#L83 streams_available? will always be false because SocketError resetting @client.remote_settings to default value from http-2 gem https://github.com/igrigorik/http-2/blob/master/lib/http/2/connection.rb#L11-L19 and remote_max_concurrent_streams always return zero because of that condition https://github.com/ostinelli/apnotic/blob/master/lib/apnotic/connection.rb#L94-L98.

To find this bug I've used monkey patch:

module Apnotic
  class Connection
    private
    def delayed_push_async(push)
      i = 1
      until streams_available? do
        if i < 5000
          sleep 0.001
          i+=1
        else
          raise StandardError.new("Timeout 5s on Apnotic::Connection#delayed_push_async: #{ @client.remote_settings[:settings_max_concurrent_streams] }/#{ @client.stream_count }")
        end
      end
      @client.call_async(push.http2_request)
    end
  end
end

And, as workaround - this monkey patch:

module Apnotic
  class Connection
    private
    def remote_max_concurrent_streams
      # 0x7fffffff is the default value from http-2 gem (2^31)
      if @client.remote_settings[:settings_max_concurrent_streams] == 0x7fffffff
        1
      else
        @client.remote_settings[:settings_max_concurrent_streams]
      end
    end
  end
end

So, I just changed 0 to 1. But actually I didn't understand why we use 0 if we have default value from http-2 gem. If it means some connection troubles and because of that we say that we can't use any concurrent connection, so maybe better raise some kind of exception.

Also, important to know, what another Sidekiq workers continue to successfully use push_async until next SocketError happened in another worker and it also will stuck. If I apply monkey patch, which break loop after 5 seconds, the next push_async in same worker will also trapped in this loop. So, only that worker, which raise SocketError goes into degraded state.

useful comment: https://github.com/ostinelli/apnotic/issues/64#issuecomment-406113863

ostinelli commented 5 years ago

Please see if latest 1.6.0 solves your issue.

wogg commented 5 years ago

We appear to be having this exact issue as well, with this combination (yes, 1.6.0):

   apnotic (1.6.0)
      connection_pool (~> 2)
      net-http2 (>= 0.18, < 2)

Correction: on further investigation, it seems our batch sometimes gets stuck waiting forever on .join of the items that have been pushed by push_async:

.../vendor/ruby/2.3.0/gems/net-http2-0.18.2/lib/net-http2/client.rb:59:in `sleep'
.../vendor/ruby/2.3.0/gems/net-http2-0.18.2/lib/net-http2/client.rb:59:in `join'
.../vendor/ruby/2.3.0/gems/apnotic-1.6.0/lib/apnotic/connection.rb:76:in `join'

are where the processes are when I send TERM I'll move this to a separate topic... sorry for noise,

brlo commented 5 years ago

Please see if latest 1.6.0 solves your issue.

Thank you. My issue was resolved in 1.6.0. Tested in production about 2 weeks.

benubois commented 3 years ago

Please open a new issue if you're still seeing this on 1.6.1.