ostinelli / net-http2

NetHttp2 is an HTTP/2 client for Ruby.
MIT License
140 stars 31 forks source link

socket loop thread abort_on_exception (was: Thread Safety) #4

Closed ostinelli closed 7 years ago

ostinelli commented 8 years ago

It appears that NetHttp2 might not be thread-safe.

To try it out, run this concurrent test multiple times, you will probably encounter the error HTTP2::Error::HandshakeError raised inside of the dummy server.

Full failure looks like:

Failures:

  1) Sending sync requests sends multiple GET requests concurrently
     Got 0 failures and 2 other errors:

     1.1) Failure/Error: conn << data

          HTTP2::Error::HandshakeError:
            HTTP2::Error::HandshakeError
          # /Users/roberto/.rvm/gems/ruby-2.3.0@net-http2/gems/http-2-0.8.1/lib/http/2/connection.rb:175:in `receive'
          # ./spec/support/dummy_server.rb:107:in `handle'
          # ./spec/support/dummy_server.rb:48:in `block (3 levels) in listen'

     1.2) Failure/Error: conn << data

          HTTP2::Error::HandshakeError:
            HTTP2::Error::HandshakeError
          # /Users/roberto/.rvm/gems/ruby-2.3.0@net-http2/gems/http-2-0.8.1/lib/http/2/connection.rb:175:in `receive'
          # ./spec/support/dummy_server.rb:107:in `handle'
          # ./spec/support/dummy_server.rb:48:in `block (3 levels) in listen'

Finished in 0.00471 seconds (files took 0.09018 seconds to load)
1 example, 1 failure

Investigation is needed, to see if this is an issue on the dummy server (who cares then) or on the client.

ostinelli commented 8 years ago

I found why it wasn't thread safe.

The first call to a server will start the underlying socket thread, responsible to listen on a socket. However, due to concurrency, two socket threads were actually started.

Solved by adding a mutex on thread creation.

phoet commented 7 years ago

i believe there are more thread safety issues. unfortunately it's not so simple to reproduce them...

if you look into the current implementation of ensure_open in the client:

    def ensure_open
      @mutex.synchronize do

        return if @socket_thread

        main_thread = Thread.current
        @socket     = new_socket

        @socket_thread = Thread.new do
          begin
            socket_loop

          rescue EOFError
            # socket closed
            init_vars
            main_thread.raise SocketError.new 'Socket was remotely closed'

          rescue Exception => e
            # error on socket
            init_vars
            main_thread.raise e
          end
        end
      end
    end

this method creates a new watch-dog-thread to check if the connection is alive. when something fails an error is raised on the main thread.

as far as i can see this can result in the following scenario:

main thread
   ||
   || processor A
   || create client -> watch-dog-thread
   || client.call              ||
   ||                          ||
   || processor B              ||
   || do something different   ||
   || do something different   ||
   || do something different   ||
   || do something different   ||
   || <---------raise -------- || CONNECTION-TIMEOUT
   XX

in this scenario the main thread, or say the application will exit with an unhandled timeout error.

i would argue that main_thread.raise should not be used here.

ostinelli commented 7 years ago

Yes, this is expected behavior. It is up to the main application to isolate errors if needed, but NetHttp2 will raise them when encountered.

phoet commented 7 years ago

@ostinelli well, isolation is the actual problem here. how would you go about isolation? processor B has no knowledge of any watch-dog-thread and just dies when the background thread raises an error.

ostinelli commented 7 years ago

This is in the nature of bi-directional async sockets though. How would you go about it?

BTW main usage of this gem is https://github.com/ostinelli/apnotic, example usage can be seen there.

phoet commented 7 years ago

yeah, and my comment is referring to this issue.

it might be more reasonable to use set abort_on_exception via a configuration option per thread or even just use report_on_exception.

ostinelli commented 7 years ago

yeah, and my comment is referring to this issue.

i've replied in there with how Apnotic has been designed and the solution works. If your concern is related to NetHttp2 then we can talk about it.

it might be more reasonable to use set abort_on_exception via a configuration option per thread or even just use report_on_exception.

It is not enough to just make it configurable. If an error happens in the socket thread and it is not raised in the main one, then the library needs to automatically act upon the error. This raises another series of questions discussed elsewhere (i.e. here), for instance what do we do with in-the-midst operations (should they be retried? but what if the server actually got them and they are non-idempotent operations? etc).

phoet commented 7 years ago

i understand that the issue is not trivial. all i want is a way to make sure that the background thread is not allowed to crash the application randomly. if you can think of another way...

ostinelli commented 7 years ago

I understand your concern, that's why I've reopened this issue. If you have suggestions please let me know.

phoet commented 7 years ago

just came across this blog post https://jvns.ca/blog/2015/11/27/why-rubys-timeout-is-dangerous-and-thread-dot-raise-is-terrifying/

ostinelli commented 7 years ago

Thank you @phoet, I'm pretty familiar with this. The idea though is how to enable async operations.

I've come to the conclusion that the best option might be to refactor to have an :on_error callback (or similar) which will be called with the encountered error, then it will be up to the programmer to decide what to do (raise error, retry, etc). I have still to find time to do the refactoring though.

adamcooke commented 7 years ago

I think something like the error callback would be good. If no callback is defined, simply fall back to raising the exception on the main thread (backwards compatibility and all that).

client = NetHttp2::Client.new("http://nghttp2.org")
client.on(:error) do |exception|
  # Do something
end

Is there a huge amount of refactoring involved here? I don't see any other callback architecture on the Client class so something similar to the on/emit on Client. I'll do a quick POC PR.

adamcooke commented 7 years ago

Something like this, perhaps?

https://github.com/adamcooke/net-http2/commit/bf3738a204fbe42031084d20c593d385cffdf3ad

phoet commented 7 years ago

i'd do a minor release and not do a backwards compatible change. the code could issue a warning when no callback is given. imo thread raise should be gone

ostinelli commented 7 years ago

@phoet this is a 0.x version. I won't care about backwards compatibility, that's what 0.x stands for. If I can easily make it, I will.

@adamcooke will look at your suggestion asap. Thank you.

ostinelli commented 7 years ago

@adamcooke I saw your suggestion, couple of issues there:

Any suggestions welcome.

adamcooke commented 7 years ago

You don't know the request with the current situation either, though, do you? It's entirely possible that no request caused the error too. If the socket disconnects (my actual case) when no request is in progress.

ostinelli commented 7 years ago

Indeed. Will have something hopefully by tomorrow or next week.

ostinelli commented 7 years ago

@adamcooke and @phoet can you try out the branch https://github.com/ostinelli/net-http2/tree/error_callback and see if that covers your needs?

adamcooke commented 7 years ago

This looks good to me and solves my issue (although I will need to a way to access the client object on an Apnotic Connection instance but I can monkey patch in).

ostinelli commented 7 years ago

No worries I will add this callback on Apnotic too. But if you can try it out (even with a monkey patch for now) to see if this is ok for you before I merge it in, it would be great.

adamcooke commented 7 years ago

All looked good in development. I'm now running this live and will report back in a few hours 👍

adamcooke commented 7 years ago

All running fine for me now. I'd say it was good to merge.

ostinelli commented 7 years ago

Merged and published 0.15.0. Tonight I will add the Apnotic part of it. Thank you for your help.

augustosamame commented 7 years ago

Hello guys. I think I am being bit by this issue too. My current setup is apnotic 1.0.1 and net_http2 0.14.1 My problem is that after a few hours of sidekiq and apnotic working fine I get this error in the sidekiq logs:

Socket was remotely closed
/home/deploy/rails_app/shared/bundle/ruby/2.3.0/gems/net-http2-0.14.1/lib/net-http2/client.rb:104:in `rescue in block (2 levels) in ensure_open'
/home/deploy/rails_app/shared/bundle/ruby/2.3.0/gems/net-http2-0.14.1/lib/net-http2/client.rb:98:in `block (2 levels) in ensure_open'

and my sidekiq process comes crashing down, all threads at a time. It needs to be restarted manually to work again.

Your discussion seems to be something of a similar issue.

I would appreciate it if you can confirm that this is fixed by upgrading to net-http2 0.15.0 version or if I have to add some extra code (error callback?). You mention adding the "Apnotic part of it" too.

ostinelli commented 7 years ago

Hello @augustosamame, you are commenting on a resolved issue asking if it would resolve your issue. I'd recommend that you follow the readme instructions and report if the latest version does not solve what you are experiencing.

phoet commented 7 years ago

@augustosamame i think you can simply upgrade, the additional errorhandling in the callback is optional. on top of that there is not much that you can do to recover from such errors except for maybe logging them.