nats-io / nats-pure.rb

Ruby client for NATS, the cloud native messaging system.
https://nats.io
Apache License 2.0
131 stars 30 forks source link

Multi-threaded use problems #53

Open jeromegn opened 3 years ago

jeromegn commented 3 years ago

We use nats-pure.rb in a Rails environment, spawning multiple threads (because: puma, for example), and it seems to be wrecking havoc in certain scenarios.

(We've already contacted a maintainer of this project, this issue merely serves as a more public channel of communication in the spirit that others facing the same issues can find solutions)

Even with 0.7.2, we've been having messages going to the wrong callbacks. Essentially, connecting to a NATS cluster may trigger its callback with the wrong response (for example: expecting a PONG, but receiving a messages originally destined as a response to a request, sent from another thread). This causes errors like:

NATS::IO::ConnectError
expected PONG, got MSG _INBOX.lN1A3W2inyFq5CiG7u1sbR.lN1A3W2inyFq5CiG7u1sob 1 30

or

NATS::IO::ServerError
Unknown protocol: <message body destined as a response to a request>

Our temporary solution is to use thread-local variables for memoizing NATS connections like:

Thread.curent[:nats] ||= NATS::IO::Client.new

Unrelated to connection issues, it seems like nats-pure.rb can sometimes stall the application. I've encountered this during an ActiveRecord transaction. I can't be sure what's happening, but none of the DB queries coming from a nats.subscribe block would get through. Listing the various current threads showed something like:

---- thread 0: #<Thread:0x000000015405fac8 sleep>
/Users/jerome/.rubies/ruby-2.7.2/lib/ruby/2.7.0/monitor.rb:108:in `sleep'
/Users/jerome/.rubies/ruby-2.7.2/lib/ruby/2.7.0/monitor.rb:108:in `wait'
/Users/jerome/.rubies/ruby-2.7.2/lib/ruby/2.7.0/monitor.rb:108:in `wait_for_cond'
/Users/jerome/.rubies/ruby-2.7.2/lib/ruby/2.7.0/monitor.rb:108:in `wait'
/Users/jerome/.gem/ruby/2.7.2/bundler/gems/nats-pure.rb-f04628e4d536/lib/nats/io/client.rb:484:in `block (2 levels) in request'
---- thread 1: #<Thread:0x00000001350d9ec8 /Users/jerome/.gem/ruby/2.7.2/gems/activerecord-6.1.4/lib/active_record/connection_adapters/abstract/connection_pool.rb:323 sleep>
/Users/jerome/.gem/ruby/2.7.2/gems/activerecord-6.1.4/lib/active_record/connection_adapters/abstract/connection_pool.rb:329:in `sleep'
/Users/jerome/.gem/ruby/2.7.2/gems/activerecord-6.1.4/lib/active_record/connection_adapters/abstract/connection_pool.rb:329:in `block in spawn_thread'
---- thread 2: #<Thread:0x0000000124f47cf0 /Users/jerome/.gem/ruby/2.7.2/gems/prometheus_exporter-0.7.0/lib/prometheus_exporter/server/web_server.rb:107 sleep>
/Users/jerome/.gem/ruby/2.7.2/gems/webrick-1.7.0/lib/webrick/server.rb:173:in `select'
/Users/jerome/.gem/ruby/2.7.2/gems/webrick-1.7.0/lib/webrick/server.rb:173:in `block in start'
/Users/jerome/.gem/ruby/2.7.2/gems/webrick-1.7.0/lib/webrick/server.rb:32:in `start'
/Users/jerome/.gem/ruby/2.7.2/gems/webrick-1.7.0/lib/webrick/server.rb:160:in `start'
/Users/jerome/.gem/ruby/2.7.2/gems/prometheus_exporter-0.7.0/lib/prometheus_exporter/server/web_server.rb:109:in `block in start'
---- thread 3: #<Thread:0x000000013702c550 /Users/jerome/.gem/ruby/2.7.2/gems/minitest-5.14.4/lib/minitest/parallel.rb:28 sleep_forever>
/Users/jerome/.gem/ruby/2.7.2/gems/minitest-5.14.4/lib/minitest/parallel.rb:30:in `pop'
/Users/jerome/.gem/ruby/2.7.2/gems/minitest-5.14.4/lib/minitest/parallel.rb:30:in `block (2 levels) in start'
---- thread 4: #<Thread:0x000000013702c438 /Users/jerome/.gem/ruby/2.7.2/gems/minitest-5.14.4/lib/minitest/parallel.rb:28 sleep_forever>
/Users/jerome/.gem/ruby/2.7.2/gems/minitest-5.14.4/lib/minitest/parallel.rb:30:in `pop'
/Users/jerome/.gem/ruby/2.7.2/gems/minitest-5.14.4/lib/minitest/parallel.rb:30:in `block (2 levels) in start'
---- thread 5: #<Thread:0x00000001371a0378 /Users/jerome/.gem/ruby/2.7.2/bundler/gems/nats-pure.rb-f04628e4d536/lib/nats/io/client.rb:1368 sleep>
/Users/jerome/.gem/ruby/2.7.2/bundler/gems/nats-pure.rb-f04628e4d536/lib/nats/io/client.rb:1591:in `select'
/Users/jerome/.gem/ruby/2.7.2/bundler/gems/nats-pure.rb-f04628e4d536/lib/nats/io/client.rb:1591:in `rescue in read'
/Users/jerome/.gem/ruby/2.7.2/bundler/gems/nats-pure.rb-f04628e4d536/lib/nats/io/client.rb:1588:in `read'
/Users/jerome/.gem/ruby/2.7.2/bundler/gems/nats-pure.rb-f04628e4d536/lib/nats/io/client.rb:1059:in `block in read_loop'
/Users/jerome/.gem/ruby/2.7.2/bundler/gems/nats-pure.rb-f04628e4d536/lib/nats/io/client.rb:1048:in `loop'
---- thread 6: #<Thread:0x00000001371a0288 /Users/jerome/.gem/ruby/2.7.2/bundler/gems/nats-pure.rb-f04628e4d536/lib/nats/io/client.rb:1372 sleep_forever>
/Users/jerome/.gem/ruby/2.7.2/bundler/gems/nats-pure.rb-f04628e4d536/lib/nats/io/client.rb:1077:in `pop'
/Users/jerome/.gem/ruby/2.7.2/bundler/gems/nats-pure.rb-f04628e4d536/lib/nats/io/client.rb:1077:in `block in flusher_loop'
/Users/jerome/.gem/ruby/2.7.2/bundler/gems/nats-pure.rb-f04628e4d536/lib/nats/io/client.rb:1075:in `loop'
/Users/jerome/.gem/ruby/2.7.2/bundler/gems/nats-pure.rb-f04628e4d536/lib/nats/io/client.rb:1075:in `flusher_loop'
/Users/jerome/.gem/ruby/2.7.2/bundler/gems/nats-pure.rb-f04628e4d536/lib/nats/io/client.rb:1372:in `block in start_threads!'
---- thread 7: #<Thread:0x00000001371a01c0 /Users/jerome/.gem/ruby/2.7.2/bundler/gems/nats-pure.rb-f04628e4d536/lib/nats/io/client.rb:1376 sleep>
/Users/jerome/.gem/ruby/2.7.2/bundler/gems/nats-pure.rb-f04628e4d536/lib/nats/io/client.rb:1112:in `sleep'
/Users/jerome/.gem/ruby/2.7.2/bundler/gems/nats-pure.rb-f04628e4d536/lib/nats/io/client.rb:1112:in `block in ping_interval_loop'
/Users/jerome/.gem/ruby/2.7.2/bundler/gems/nats-pure.rb-f04628e4d536/lib/nats/io/client.rb:1111:in `loop'
/Users/jerome/.gem/ruby/2.7.2/bundler/gems/nats-pure.rb-f04628e4d536/lib/nats/io/client.rb:1111:in `ping_interval_loop'
/Users/jerome/.gem/ruby/2.7.2/bundler/gems/nats-pure.rb-f04628e4d536/lib/nats/io/client.rb:1376:in `block in start_threads!'
---- thread 8: #<Thread:0x000000013532fc18 /Users/jerome/.gem/ruby/2.7.2/bundler/gems/nats-pure.rb-f04628e4d536/lib/nats/io/client.rb:419 run>
/Users/jerome/src/github.com/superfly/web/test/graphql/mutations/launch_machine_test.rb:39:in `backtrace'
/Users/jerome/src/github.com/superfly/web/test/graphql/mutations/launch_machine_test.rb:39:in `block (3 levels) in <class:LaunchMachineTest>'
/Users/jerome/src/github.com/superfly/web/test/graphql/mutations/launch_machine_test.rb:37:in `each'
/Users/jerome/src/github.com/superfly/web/test/graphql/mutations/launch_machine_test.rb:37:in `each_with_index'
/Users/jerome/src/github.com/superfly/web/test/graphql/mutations/launch_machine_test.rb:37:in `block (2 levels) in <class:LaunchMachineTest>'
---- thread 9: #<Thread:0x000000013532c428 /Users/jerome/.gem/ruby/2.7.2/bundler/gems/nats-pure.rb-f04628e4d536/lib/nats/io/client.rb:419 sleep_forever>
/Users/jerome/.gem/ruby/2.7.2/bundler/gems/nats-pure.rb-f04628e4d536/lib/nats/io/client.rb:421:in `pop'
/Users/jerome/.gem/ruby/2.7.2/bundler/gems/nats-pure.rb-f04628e4d536/lib/nats/io/client.rb:421:in `block (2 levels) in subscribe'
/Users/jerome/.gem/ruby/2.7.2/bundler/gems/nats-pure.rb-f04628e4d536/lib/nats/io/client.rb:420:in `loop'
/Users/jerome/.gem/ruby/2.7.2/bundler/gems/nats-pure.rb-f04628e4d536/lib/nats/io/client.rb:420:in `block in subscribe'
---- thread 10: #<Thread:0x00000001261c4b80 /Users/jerome/.gem/ruby/2.7.2/bundler/gems/nats-pure.rb-f04628e4d536/lib/nats/io/client.rb:1394 sleep_forever>
/Users/jerome/.gem/ruby/2.7.2/bundler/gems/nats-pure.rb-f04628e4d536/lib/nats/io/client.rb:1396:in `pop'
/Users/jerome/.gem/ruby/2.7.2/bundler/gems/nats-pure.rb-f04628e4d536/lib/nats/io/client.rb:1396:in `block (2 levels) in start_resp_mux_sub!'
/Users/jerome/.gem/ruby/2.7.2/bundler/gems/nats-pure.rb-f04628e4d536/lib/nats/io/client.rb:1395:in `loop'
/Users/jerome/.gem/ruby/2.7.2/bundler/gems/nats-pure.rb-f04628e4d536/lib/nats/io/client.rb:1395:in `block in start_resp_mux_sub!'

I'm not savvy enough about nats-pure.rb to do much about this, unfortunately.

jsierles commented 2 years ago

Hey, just checking if anyone has looked into this yet. Perhaps it was addressed in the 2.0 release?

wallyqs commented 2 years ago

@jsierles There were some fixes in the v2.0.0 release but I haven't been able to validate whether this is still happening or not, a test setup that reproduces the issue would help...

jsierles commented 2 years ago

We'll give it a spin!

Envek commented 1 year ago

it seems like nats-pure.rb can sometimes stall the application. I've encountered this during an ActiveRecord transaction. I can't be sure what's happening, but none of the DB queries coming from a nats.subscribe block would get through.

Well, that's complex thing. ActiveRecord implicitly checkout a database connection from its connection pool for every thread that tries to execute a query (so every thread is given its own connection). This means that many subscriptions will consume many database connections: right now nats-pure creates 1 thread for every subscription (yes, subscription callback is executed in different thread from application one).

But in my understanding it shouldn't stall the app: there should be a lot of ActiveRecord::ConnectionTimeoutError if connection pool capacity is significantly smaller than number of requesting threads (both puma/sidekiq and nats subscriptions ones). Performance will be degraded too.

Probably a database connection taken by a subscription callback isn't checked back into connection pool and thus it is leaked. So, maybe, we will need to automagically integrate with Rails Executor to return back all resources, that were implicitly taken during subscription callback execution.

But yeah, unrelated to connection issues