Multiple workers on same queue

ghost commented 8 years ago

When I start multiple workers for the same queue, none of them is displayed in the resque console (admin view). I dont have an error message or something, so would be great if you can advise me how I can support you best at this issue.

Thanks!

RochesterinNYC commented 8 years ago

When you pull up the Rails console (are you using Rails or no?) and run Resque.workers, do you see any workers ? Wondering if your workers are actually running and it's a resque-web related problem or if your workers silently fail or something.

Any other details about how you're starting the workers, queue setup, etc. would help as well.

ghost commented 8 years ago

ok, got new insights. I have two workers on different servers. Both are processing something. I can see this from the htop console via pstree -cpa In the terminal. Both are visible now.

However, only one worker is set to working in Resque:

Resque.workers.map { |w| w if w.working? }.compact

The worker, which is not visible as working was started via

RAILS_ENV=production BACKGROUND=yes QUEUE="${w}" rake resque:work

with w as the worker name. The visible worker was started on another machine with

QUEUE="${w}" rake resque:work

I will keep investigating it, but if you have an idea where to look, I would really appreciate it. :)

thank you!

steveklabnik commented 8 years ago

It would be nice to get a reproduction of this issue.

ghost commented 8 years ago

I will probably find some time for this next week. Sorry for the delay

pjammer commented 8 years ago

I can replicate this right now. I was on the last tagged Resque, 1.25.2(which had 6 workers until they either kill -9 themselves or something happened) and using capistrano-resque. I just switched to this 1-x-stable branch and now the resque web console shows the # of workers that Resque.workers has in the rails console but that does not match up with the available pids. i set 3 worker queues, with 2 workers each. There are 6 pids created, but in rails console, Resque.workers shows 4, and only 1 of them matches a pid number in ps aux, yet ps aux has 6 workers running and waiting for my queues.

capistrano resque runs startup command during a deploy (this is .6, but there are 5 more as well)

/usr/bin/env nohup bundle exec rake RAILS_ENV=production QUEUE="low" PIDFILE=/home/deploy/fakeapp/shared/tmp/pids/resque_work_6.pid BACKGROUND=yes VERBOSE=1 INTERVAL=5 environment resque:work >> /home/deploy/fakeapp/shared/log/resque.log 2>> /home/deploy/fakeapp/shared/log/resque.log

rails console

irb(main):001:0> Resque.workers
=> [#<Worker prodworker01:21604:high>, #<Worker prodworker01:21498:critical>, #<Worker prodworker01:21728:low>, #<Worker prodworker01:21458:critical>]

pids

deploy 21458  0.0  1.9 632872 162068 ?       Sl   09:00   0:03 resque-1.25.2: Waiting for critical                                                                                                                                                                                                                          
deploy 21504  0.0  1.9 633088 162068 ?       Sl   09:00   0:02 resque-1.25.2: Waiting for critical                                                                                                                                                                                                                          
deploy 21560  0.0  2.0 633896 163820 ?       Sl   09:00   0:02 resque-1.25.2: Waiting for high                                                                                                                                                                                                                          
deploy 21622  0.0  2.0 633228 164272 ?       Sl   09:00   0:02 resque-1.25.2: Waiting for high                                                                                                                                                                                                                          
deploy 21684  0.0  2.0 633668 165060 ?       Sl   09:00   0:02 resque-1.25.2: Waiting for low                                                                                                                                                                                                                          
deploy 21752  0.0  2.0 633636 165136 ?       Sl   09:00   0:02 resque-1.25.2: Waiting for low

I originally thought that these workers committed pid suicide but I wonder first if this issue could help that one. Also, I have another issue that Resque has never closed a connection to the DB, or is the working theory. but I will write tickets for that, this was just to remind myself.

ghost commented 8 years ago

Thank you! Looks similar to my case. The only difference is that my workers have been started on different machines.

pjammer commented 8 years ago

I've just looked into why Resque.workers would even give the wrong pids. I checked the Redis Namespace gem and that hasn't been updated in 100 years or 3. Meaning that i don't think how we would call the workers changed, and thus was it the initialization of a worker and however all that worked that maybe leads us astray? Looking into this commit has moved the logic from the rake task into the Worker but should it have no args passed, as it once did to Workers.new(...)#()is missing it passes nothing. I wonder if how our Rake task envs are set if that mucks up with this new way of thinking? I could be wrong, but i am seeing a couple of issues with this code move that didn't, at this initial time, look correct. But now i guess we test and see what's up with it and/or keep looking.

worldofchris commented 8 years ago

I have a problem which I think might be related. There appears to be a mismatch between the number of workers Resque reports as working via Resque.info (and via the Web UI) and the number of jobs being worked on when you grep for them in ps.

I'm

Using resque 1.25.2 from git://github.com/resque/resque.git (at 1-x-stable@9c023f8)

Here's what I see from Resque.info:

Loading production environment (Rails 4.2.5)
irb(main):001:0> Resque.info
=> {:pending=>12, :processed=>873733, :queues=>15, :workers=>27, :working=>3, :failed=>31, :servers=>["redis://eu-pro-li-hon-red-01.hemsjm.0001.euw1.cache.amazonaws.com:6379/0"], :environment=>"production"}

Here's what I see from ps:

ps -e -o pid,command | grep [r]esque
 1755 resque-1.25.2: Waiting for copy_ingest_master_to_vantage
 1871 resque-1.25.2: Forked 18754 at 1453301118
 1881 resque-1.25.2: Forked 18620 at 1453300883
 1921 resque-1.25.2: Forked 18630 at 1453300891
 1955 resque-1.25.2: Forked 18674 at 1453301004
 1965 resque-1.25.2: Forked 18648 at 1453300907
 2007 resque-1.25.2: Forked 18481 at 1453300535
 2039 resque-1.25.2: Forked 18743 at 1453301105
 2061 resque-1.25.2: Forked 18558 at 1453300714
 2172 resque-1.25.2: Waiting for copy_destination_master_from_vantage
 2242 resque-1.25.2: Waiting for copy_destination_master_from_vantage
 2248 resque-1.25.2: Waiting for copy_destination_master_from_vantage
 2298 resque-1.25.2: Waiting for copy_destination_master_from_vantage
18481 resque-1.25.2: Processing transcode_destination_masters since 1453300535 [ActiveJob::QueueAdapters::ResqueAdapter::JobWrapper]
18558 resque-1.25.2: Processing transcode_destination_masters since 1453300714 [ActiveJob::QueueAdapters::ResqueAdapter::JobWrapper]
18620 resque-1.25.2: Processing transcode_destination_masters since 1453300883 [ActiveJob::QueueAdapters::ResqueAdapter::JobWrapper]
18630 resque-1.25.2: Processing transcode_destination_masters since 1453300891 [ActiveJob::QueueAdapters::ResqueAdapter::JobWrapper]
18648 resque-1.25.2: Processing transcode_destination_masters since 1453300907 [ActiveJob::QueueAdapters::ResqueAdapter::JobWrapper]
18674 resque-1.25.2: Processing transcode_destination_masters since 1453301004 [ActiveJob::QueueAdapters::ResqueAdapter::JobWrapper]
18743 resque-1.25.2: Processing transcode_destination_masters since 1453301105 [ActiveJob::QueueAdapters::ResqueAdapter::JobWrapper]
18754 resque-1.25.2: Processing transcode_destination_masters since 1453301118 [ActiveJob::QueueAdapters::ResqueAdapter::JobWrapper]

So unless I am misunderstanding things, Resque.info is reporting three workers working but from ps it looks like there are eight.

Eight is what I would expect as we are starting eight workers with the resque:workers rake task.

Any help on this would be greatly appreciated.

Thanks

steveklabnik commented 8 years ago

@worldofchris yes, I would very much expect those numbers to be the same. hmm

antstorm commented 8 years ago

@steveklabnik poked around a bit and it seems that not only there are less tracked workers, but also none of the PIDs match (compared redis keys and workers set members agains ps output).

It also seems that the amount of reported workers for 2 different queues is limited to 3 (although there are actually 4 and 8 running processes). What's funny is that there's another machine (setup exactly the same using Ansible script, but with different queues) and it's reporting fine.

steveklabnik commented 8 years ago

Wow, that two machines thing is baffling. Especially since you're using Ansible; I would assume that it was some mis-configuration by hand otherwise.

Does this reproduce on a newly provisioned machine?

m-koepke commented 8 years ago

I can confirm this error.

5 Workers running:

~/apps/data_management/current$ ps aux | grep resque
306  0.0  0.3 419500 123320 ?       Sl   09:43   0:00 resque-1.25.2: Waiting for ftp_demons,folder_item_notifiers,cleanup                                                                                                                                                                                                                                 
635  0.0  0.3 415644 119556 ?       Sl   09:43   0:00 resque-1.25.2: Waiting for job_processors                                                                                                                                                                                                                               
958  0.0  0.3 414996 118320 ?       Sl   09:43   0:00 resque-1.25.2: Waiting for job_processors                                                                                                                                                                                                                               
1280  0.0  0.3 415440 119320 ?       Sl   09:44   0:00 resque-1.25.2: Waiting for job_processors                                                                                                                                                                                                                               
1613  0.0  0.3 415560 119548 ?       Sl   09:44   0:00 resque-1.25.2: Waiting for job_processors                                                                                                                                                                                                                               
1948  0.1  0.3 674472 118096 ?       Sl   09:44   0:01 resque-scheduler-4.0.0[production]: Schedules Loaded

PIDs match pidfiles:

Resque is just reporting 2 workers with wrong PIDs:

Resque::Worker.all
=> [#<Worker serverxxxx:32762:ftp_demons,folder_item_notifiers,cleanup>, #<Worker serverxxxx:1601:job_processors>]                                                                                                                                                                                

puts Resque.redis.smembers(:workers)
serverxxxx:32762:ftp_demons,folder_item_notifiers,cleanup
serverxxxx:1601:job_processors

Resque.info
=> {:pending=>0, :processed=>1139337, :queues=>4, :workers=>2, :working=>0, :failed=>0, :servers=>["redis://127.0.0.1:6379/0"], :environment=>"production"}

OS:
Ubuntu 14.04 LTS

Ruby:
ruby 2.2.3p173

Gems:
capistrano-resque (0.2.2 7e94a2f)
rails (4.2.5.1)
redis (3.2.2)
redis-namespace (1.5.2)
resque (1.25.2 e717e6e)
resque-scheduler (4.0.0)
resque_solo (0.1.0)

Redis:
Redis server v=2.8.4

Update:

This seems to be related to deployment with capistrano-resque. Starting the workers "by hand" seems to work. Looking into this more deeply.

Update 2:

Seems to be related to PIDFILE option. Starting with PIDFILE option does not add the resque workers to Redis. Starting without does:

With PIDFILE option workers do not appear in Redis

    bundle exec rake RACK_ENV=staging RAILS_ENV=staging  QUEUE="job_processors" PIDFILE=/home/apps/data_management/shared/tmp/pids/resque_work_5.pid VERBOSE=1 INTERVAL=5 resque:work >> log/resque.log 2>> log/resque.log &
    [2] 22455

    Resque.workers
    => [#<Worker serveryyyy:22455:job_processors>, #<Worker serveryyyy:19845:ftp_demons,folder_item_notifiers,cleanup>] 

    pp Redis.new.keys.select{|k| /job_processors/.match(k)}.sort;nil
    ["resque:data_management:stat:processed:serveryyyy:20175:job_processors", "resque:data_management:worker:serveryyyy:22455:job_processors:started"]

Without PIDFILE option workers do appear in Redis

    bundle exec rake RACK_ENV=staging RAILS_ENV=staging  QUEUE="job_processors" VERBOSE=1 INTERVAL=5 resque:work >> log/resque.log 2>> log/resque.log &
    [3] 22935

    Resque.workers
    => [#<Worker serveryyyy:22935:job_processors>, #<Worker serveryyyy:22455:job_processors>, #<Worker serveryyyy:19845:ftp_demons,folder_item_notifiers,cleanup>] 

    pp Redis.new.keys.select{|k| /job_processors/.match(k)}.sort;nil
    ["resque:data_management:stat:processed:serveryyyy:20175:job_processors", 
    "resque:data_management:stat:processed:serveryyyy:22455:job_processors",
    "resque:data_management:worker:serveryyyy:22455:job_processors:started",
    "resque:data_management:worker:serveryyyy:22935:job_processors:started"]

Update 3:

Specifying a revision before PIDFILE was moved from rake task to worker.rb fixes this:

gem 'resque', github: 'resque/resque', ref: '2aa6964f4f0097f2df2d7783ad262bf36b7c7907'

steveklabnik commented 8 years ago

Thanks so much for digging into this further!

lakesare commented 8 years ago

same issue when I'm trying to create 8 workers: when two queues processed in one queue (*), maximum amount of workers being created is 2. ps | grep resque displays them, pids display them, redis doesn't see them. first actual pid matches first worker that redis sees, second pid that redis sees changes as other workers are trying to be created.

(and using this gem 'resque', github: 'resque/resque', ref: '2aa6964f4f0097f2df2d7783ad262bf36b7c7907' worked indeed)

kitop commented 8 years ago

Having the same issue here. Reverting to https://github.com/resque/resque/commit/2aa6964f4f0097f2df2d7783ad262bf36b7c7907 solves the issue.

What was the reason for 7d0b20ee00986b749b03d4407af199039b504cc2?

rbhitchcock commented 8 years ago

Seeing the same behavior as @lakesare. Can only start 2 workers per queue name.

Ubuntu 14.04, Rails 4.2.5.1

Thanks!

sumantacaptain commented 8 years ago

Hi, I am using Ruby 1.8.7/ Rails 2.3.5, Resque 1.20.0. I got a scenario from last 3/4 days - my resque-web is not showing proper number of workers - its always showing the below mentioned worker details-

1 Workers The workers listed below are all registered as active on your system.

Where   Queues                       Processing
     idle            server02:12379

Though I have more workers which are not showing in resque-web and resque is not working too! In short - 1. resque is not working 2. workers are not showing in resque-web but showing in terminal. Can anyone please help?

hackhowtofaq commented 8 years ago

Guys just beaten by this! I'm writing this for reference

My problem was the time difference between the servers!!

Your servers must be in sync, otherwise the heartbeat timestamp will be > Resque.prune_interval and your worker will be removed from Redis but still running on server

For example, assume 3 servers server 1 -> date 12:41:58 server 2 -> date 12:41:58 server 3 -> date 13:41:58

If you start your workers first on server 1 & server 2, and then on server 3, you will only see server 3 workers on Redis, since the Resque clean up on server 3 removes the other two (1 & 2)

Even worse, if the time in one server is changing for some reason when all workers are up, prepare for chaos ;)

fw42 commented 8 years ago

Your servers must be in sync, otherwise the heartbeat timestamp will be > Resque.prune_interval

Ah. Hm. Yea, that sucks. I bet we can come up with a way to depend only on the Redis server's clock (not the worker's clocks).

hackhowtofaq commented 8 years ago

Use probably Redis time instead of Time.now ?

fw42 commented 8 years ago

Something like this might work: https://github.com/resque/resque/pull/1480/

fw42 commented 8 years ago

@hackhowtofaq, not sure if your issue is related to everybody else's issue in this thread. If someone else here could try out my branch and check if it solves their problem, that would be appreciated.

hackhowtofaq commented 8 years ago

Ty fw42, that was quick...

CarlosRoque commented 8 years ago

I was having the same problem. start 8 workers on the same queue only one shows up. started happening after updating to 1.26. I had to roll back to 1.25.2 and now everything works correctly

this worked resque 1.25.2 resque-retry 1.5.0 resque-scheduler 4.2.0

this did not work resque 1.26.0 resque-retry 1.5.0 resque-scheduler 4.3.0

uhrohraggy commented 7 years ago

Any update here. we have 4 in our low queue and 6 in our high queue, and only 0 of 3 workers is showing in the resque-console...we are also on 1.26.0

uhrohraggy commented 7 years ago

Just a quick update that reverting to 1.25.2 gem worked. 1.26.0 did not

fw42 commented 7 years ago

Can someone try to confirm if https://github.com/resque/resque/pull/1480 fixed the issue please?

davegudge commented 7 years ago

@fw42 I can confirm that bfd6b5a resolves the issue outlined in this thread.

I have now updated my Gemfile whist awaiting a new resque release:

# Tracking master to include an unreleased fix for https://github.com/resque/resque/issues/1357
# https://github.com/resque/resque/pull/1480/commits/bfd6b5a37c2d9a4d3bab30e793c7d5d25d8445d7
gem "resque", github: "resque/resque", ref: "20d8850"

Let me know if you require anymore details.

oss92 commented 7 years ago

Why not release 1.27.0 as a fix with the commit 20d8850?

fw42 commented 7 years ago

I don't have permission to publish a new version of resque to Rubygems. @steveklabnik does, but not sure how interested he still is in being involved with resque.

oss92 commented 7 years ago

would be very nice if you can publish a version @steveklabnik 👍 Also I am more than willing to help with resque.

evgeny-s commented 7 years ago

Same issue here.

I don't have permission to publish a new version of resque to Rubygems. @steveklabnik does, but not sure how interested he still is in being involved with resque.

Whether it means that new versions of Resque gem will never release?

davegudge commented 7 years ago

Perhaps it's time for someone else to take ownership if @steveklabnik is no longer interested, as v1.26.0 was released almost a year ago (11 Mar 2016)

evgeny-s commented 7 years ago

Good idea. What about @fw42? Seems he has access to repo, so he could maintain this repo in future.

oss92 commented 7 years ago

I forked and released master of this repo as https://rubygems.org/gems/resqueue. I am currently using it in production and works great.

evgeny-s commented 7 years ago

It's strange. I've deployed app like @davegudge sayd with the specific commit. But issue still exists.

ps aux | grep resque: 
0:00 resque-1.26.0: Waiting for 
0:00 resque-1.26.0: Waiting for 
0:00 resque-1.26.0: Waiting for

But Resque.info says:

:workers=>2

Also in Redis I have only 2 workers:

resque:worker:host:7925:*:started"
resque:worker:host:8041:*:started"

davegudge commented 7 years ago

How are you deploying @evgeny-s ? capistrano-resque?

evgeny-s commented 7 years ago

via capistrano

cap host deploy

evgeny-s commented 7 years ago

and deploy file has:

set :workers, {
    :"*" => 3
}

davegudge commented 7 years ago

So for reference, my Gemfile looks like this:

# Resque is a background job queue processor
# Tracking master to include an unreleased fix for https://github.com/resque/resque/issues/1357
# https://github.com/resque/resque/pull/1480/commits/bfd6b5a37c2d9a4d3bab30e793c7d5d25d8445d7
gem "resque", github: "resque/resque", ref: "20d8850"
...
group :development do
  ...
  # Capistrano 3
  # Locking to 3.5.0 as 'config/deploy.rb' currently locked at this version.
  gem "capistrano", "3.5.0", require: false
  ...
  gem "capistrano-resque", require: false, github: "skillstream/capistrano-resque", branch: "fire-and-forget"

evgeny-s commented 7 years ago

I also tried stop and start resque workers via ssh on the server. But issue still exists. We have Redis and workers on the same servers, so I'm not sure the problem in Time determination as described in that thread.

steveklabnik commented 7 years ago

Sorry all; I did not know I was a blocker here. I am happy to give current people pushing access. @fw42 can you email me?

steveklabnik commented 7 years ago

(And sorry i missed the at replies; the holidays were brutal for me and i'm still catching up on emails from november.)

fw42 commented 7 years ago

I'm happy to release a new version if given access. But I currently don't have the time to go through all the changes and determine if any of them are non-backwards-compatible (in order to not accidentally break semantic versioning). Anyone available to help with that?

corincerami commented 7 years ago

I've gone through the commits since 1.26.0 and I don't see anything that should be blocking a release or breaking backwards compatibility, so I'm prepping 1.27.0 now. It should probably be released today.

fw42 commented 7 years ago

❤️

corincerami commented 7 years ago

https://rubygems.org/gems/resque/versions/1.27.0

resque / resque