nesquena / backburner

Simple and reliable beanstalkd job queue for ruby
http://nesquena.github.com/backburner
MIT License
428 stars 68 forks source link

ThreadsOnFork multithreaded worker Timeout problem #86

Closed silentshade closed 8 years ago

silentshade commented 9 years ago

We've ran into big problem using ThreadsOnFork with the job execution timeout. The Timeout::timeout is unsafe in multithreaded environment. Good explanation here http://blog.headius.com/2008/02/ruby-threadraise-threadkill-timeoutrb.html Sidekiq similar issue, they removed timeout support completely https://github.com/mperham/sidekiq/issues/862#event-245331424 I tried to locate the issue myself.

Environment I've made a job class with timeout of 30 seconds, and code that runs from 1 to 3 minutes. The code includes some db access, logging, throwing and rescuing exceptions, setting and getting thread variables, etc.. The tube workers (threads) are limited to 30.

On ruby 1.9.3 Untested yet, but i'm thinking it will be same as 2.0.0

On ruby 2.0.0 Likely because the timeout hits some rescue block somewhere inside the code, threads(workers) die very fast. In about one minute workers count decreased from 30 to 1. As a i get following warnings in the console log writing failed. execution expired, i suppose that timeout hits some rescue block within system Logger#write, which is used heavily in code. Its another bug with this method, explained here: https://bugs.ruby-lang.org/issues/9115 Comments there are kind of unclear, but seems that in ruby 2.1 Timeout::Error is not rescued by default unless explicitly given as rescue argument.

On ruby 2.1.3 Logger#write is not a problem anymore, threads are living much longer. However they die anyway after some time, timeout is hitting them somewhere in rescue block. Within half an hour workers count decreased from 30 to 1..

We need to consider another flow to timeout code execution and clean up within threaded worker. Any thoughts? Inviting @nesquena :)

contentfree commented 8 years ago

@nesquena Should we follow Sidekiq's lead here and remove Timeout usage from the library (leaving timing out of work up to the job itself)? Beanstalkd will still farm work out after the ttr expires.

I think this might be related to the NOT_FOUND issue: Since we use the same timeout as task.ttr then there's a fraction of a second where the ttr expires on beanstalkd's side and another worker possibly reserves that task, and then the former worker tries to delete the task after the :around_perform hook which finally results in NOT_FOUND because only the client with the reservation can delete the now-reserved job.

One hack might be to set our timeout to task.ttr - 1 in job.rb:49.

Thoughts?

nesquena commented 8 years ago

@contentfree we used to have it as task.ttr -1, can we try going back to that?

nesquena commented 8 years ago

I see that it could make sense to remove Timeout all together, if we still see this behavior after reverting back to - 1 then we should re-discuss.

contentfree commented 8 years ago

I think the -1 was removed so that a timeout of zero could be used to mean "never timeout". So we'll just need to put a condition on it.

contentfree commented 8 years ago

@nesquena: In addition to a) continue to handle a timeout of 0 and b) reverting to ttr - 1, should we also change resolve_respond_timeout to add one to the ttr when enqueuing as well so that we don't have to worry about things like a ttr of 1 leading to no timeout (since subtracting one would result in jobs never timing out?

It get messy… I'm liking the removal of the Timeout usage altogether and leaving it up to the job to do timeouts. However, it brings up its own set of difficulties (such as retrying with hooks and delay upon job failure / timeout; async proxies; automagic async)...

contentfree commented 8 years ago

Been working on removing the static connection (towards an instance connection) for a couple days and have hit a brick wall. I've done a bunch of the work and trying to work through the practical tests for the ThreadsOnForkWorker. However, things are hanging when trying to connect the TCPSocket… the weird thing is if I add a sleep 0.1 at the beginning of the test then everything works… @nesquena, you got some time to bounce some ideas around? I can push a branch up to my fork. Non-trivial changes, of course.

contentfree commented 8 years ago

@nesquena – Looks like I'm running into issues similar to https://github.com/redis/redis-rb/issues/436. After forking, some mutex somewhere deep in the bowels of Ruby or DNS (or …?) hasn't been released. So, without an artificial and arbitrary pause – the sleep 0.1 I found to "fix" the issue – TCPSocket.new deadlocks. Another "fix" I found was to use 127.0.0.1 instead of localhost as it bypasses DNS.

I'm now checking to see whether the deadlock remains in Ruby 2.2 (it exists in Ruby 2.1.4). If so, it's off to find the simplest recreation and file a bug against Ruby. We might also consider using resolv-replace from std-lib.

(I realize these notes aren't exactly related to this issue…)

contentfree commented 8 years ago

Same deadlock exists in Ruby 2.2, unfortunately.

nesquena commented 8 years ago

Thanks for the updates so far. It's frustrating when the best solution is adding sleep 0.1. Odd that the hostname being 127.0.0.1 makes it work as well. Do you have an idea of how you are hoping to move forward, unfortunately I don't have any additional visibility into that behavior.

contentfree commented 8 years ago

@nesquena The 127.0.0.1 "fix" isn't really a fix, of course, it just sidestepped the problem by not requiring the DNS lookup. Obviously using sleep 0.1 in the tests isn't really a fix either. I'm not sure what's actually going on yet (besides it deadlocking…). Here some info I've collected: https://gist.github.com/contentfree/a20d5dd63cacb8588555. I still haven't committed my changes and pushed them up to my fork… I'll try to do that soon.

nesquena commented 8 years ago

Nice, thanks for patching this.

contentfree commented 8 years ago

@silentshade Interested if this is fixed by my commit? (Disregard the other deadlock stuff)

contentfree commented 8 years ago

@nesquena I've pushed my branch for the instance-var connection and I'd like your eyes on it. All tests pass (including several new ones). It also has support for on_reconnect hooks for workers.

I'm slightly reconsidering if I absolutely had to get rid of the static connection since it should be more resilient to disconnects/reconnects overall. (Though it seems like it's better in a theoretical sense.)

silentshade commented 8 years ago

@contentfree I've recently moved to another project, which is not using backburner (yet), so i can't really comment on this now. Regarding Timeout, the problem is - it can hit the main thread in the middle of some rescue or ensure block leaving resources in inconsistent state.