Closed silentshade closed 8 years ago
@nesquena Should we follow Sidekiq's lead here and remove Timeout usage from the library (leaving timing out of work up to the job itself)? Beanstalkd will still farm work out after the ttr
expires.
I think this might be related to the NOT_FOUND issue: Since we use the same timeout as task.ttr
then there's a fraction of a second where the ttr expires on beanstalkd's side and another worker possibly reserves that task, and then the former worker tries to delete the task after the :around_perform
hook which finally results in NOT_FOUND because only the client with the reservation can delete the now-reserved job.
One hack might be to set our timeout to task.ttr - 1
in job.rb:49
.
Thoughts?
@contentfree we used to have it as task.ttr -1, can we try going back to that?
I see that it could make sense to remove Timeout all together, if we still see this behavior after reverting back to - 1
then we should re-discuss.
I think the -1 was removed so that a timeout of zero could be used to mean "never timeout". So we'll just need to put a condition on it.
@nesquena: In addition to a) continue to handle a timeout of 0 and b) reverting to ttr - 1
, should we also change resolve_respond_timeout
to add one to the ttr
when enqueuing as well so that we don't have to worry about things like a ttr
of 1 leading to no timeout (since subtracting one would result in jobs never timing out?
It get messy… I'm liking the removal of the Timeout usage altogether and leaving it up to the job to do timeouts. However, it brings up its own set of difficulties (such as retrying with hooks and delay upon job failure / timeout; async proxies; automagic async)...
Been working on removing the static connection (towards an instance connection) for a couple days and have hit a brick wall. I've done a bunch of the work and trying to work through the practical tests for the ThreadsOnForkWorker. However, things are hanging when trying to connect the TCPSocket… the weird thing is if I add a sleep 0.1
at the beginning of the test then everything works… @nesquena, you got some time to bounce some ideas around? I can push a branch up to my fork. Non-trivial changes, of course.
@nesquena – Looks like I'm running into issues similar to https://github.com/redis/redis-rb/issues/436. After forking, some mutex somewhere deep in the bowels of Ruby or DNS (or …?) hasn't been released. So, without an artificial and arbitrary pause – the sleep 0.1
I found to "fix" the issue – TCPSocket.new
deadlocks. Another "fix" I found was to use 127.0.0.1
instead of localhost
as it bypasses DNS.
I'm now checking to see whether the deadlock remains in Ruby 2.2 (it exists in Ruby 2.1.4). If so, it's off to find the simplest recreation and file a bug against Ruby. We might also consider using resolv-replace
from std-lib.
(I realize these notes aren't exactly related to this issue…)
Same deadlock exists in Ruby 2.2, unfortunately.
Thanks for the updates so far. It's frustrating when the best solution is adding sleep 0.1. Odd that the hostname being 127.0.0.1
makes it work as well. Do you have an idea of how you are hoping to move forward, unfortunately I don't have any additional visibility into that behavior.
@nesquena The 127.0.0.1
"fix" isn't really a fix, of course, it just sidestepped the problem by not requiring the DNS lookup. Obviously using sleep 0.1
in the tests isn't really a fix either. I'm not sure what's actually going on yet (besides it deadlocking…). Here some info I've collected: https://gist.github.com/contentfree/a20d5dd63cacb8588555. I still haven't committed my changes and pushed them up to my fork… I'll try to do that soon.
Nice, thanks for patching this.
@silentshade Interested if this is fixed by my commit? (Disregard the other deadlock stuff)
@nesquena I've pushed my branch for the instance-var connection and I'd like your eyes on it. All tests pass (including several new ones). It also has support for on_reconnect
hooks for workers.
I'm slightly reconsidering if I absolutely had to get rid of the static connection since it should be more resilient to disconnects/reconnects overall. (Though it seems like it's better in a theoretical sense.)
@contentfree I've recently moved to another project, which is not using backburner (yet), so i can't really comment on this now. Regarding Timeout, the problem is - it can hit the main thread in the middle of some rescue or ensure block leaving resources in inconsistent state.
We've ran into big problem using
ThreadsOnFork
with the job execution timeout. TheTimeout::timeout
is unsafe in multithreaded environment. Good explanation here http://blog.headius.com/2008/02/ruby-threadraise-threadkill-timeoutrb.html Sidekiq similar issue, they removed timeout support completely https://github.com/mperham/sidekiq/issues/862#event-245331424 I tried to locate the issue myself.Environment I've made a job class with timeout of 30 seconds, and code that runs from 1 to 3 minutes. The code includes some db access, logging, throwing and rescuing exceptions, setting and getting thread variables, etc.. The tube workers (threads) are limited to 30.
On ruby 1.9.3 Untested yet, but i'm thinking it will be same as 2.0.0
On ruby 2.0.0 Likely because the timeout hits some rescue block somewhere inside the code, threads(workers) die very fast. In about one minute workers count decreased from 30 to 1. As a i get following warnings in the console
log writing failed. execution expired
, i suppose that timeout hits some rescue block within systemLogger#write
, which is used heavily in code. Its another bug with this method, explained here: https://bugs.ruby-lang.org/issues/9115 Comments there are kind of unclear, but seems that in ruby 2.1 Timeout::Error is not rescued by default unless explicitly given as rescue argument.On ruby 2.1.3
Logger#write
is not a problem anymore, threads are living much longer. However they die anyway after some time, timeout is hitting them somewhere in rescue block. Within half an hour workers count decreased from 30 to 1..We need to consider another flow to timeout code execution and clean up within threaded worker. Any thoughts? Inviting @nesquena :)