Starting / Stopping workers when using Capistrano

joshuapaling commented 9 years ago

I'm using Capistrano v3 for deployment, along with Cake-Resque.

Resque is not correctly finding existing workers. For example, see the below terminal output (note the ps ux shows that a resque worker is running, but the stats indicate none exist):

staging@sabre740:~/public_html/current/app/Console$ ps ux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
staging   5879  0.0  0.0  10772  1400 ?        S    12:51   0:00 bash -c cd '/home/staging/public_html/releases/20141014015143/app/Vendor/kamisama/php-resque-ex';     VERBOSE=true  QUEUE='defaul
staging   5880  0.0  0.7 161236 15348 ?        S    12:51   0:00 php ./bin/resque
staging   6513  0.0  0.0  16836  1256 pts/6    R+   12:55   0:00 ps ux
staging  16485  0.0  0.0  73312  1660 ?        S    11:07   0:00 sshd: staging@pts/6
staging  16486  0.0  0.1  20688  3600 pts/6    Ss   11:07   0:00 -bash
staging@sabre740:~/public_html/current/app/Console$
staging@sabre740:~/public_html/current/app/Console$
staging@sabre740:~/public_html/current/app/Console$ /home/staging/public_html/releases/20141014015143/app/Console/cake CakeResque.CakeResque stats

Resque Statistics
---------------------------------------------------------------

Jobs Stats
   Processed Jobs :            0
   Failed Jobs    :            0

Queues Stats
   Queues count : 0

Workers Stats
   Workers count : 0

staging@sabre740:~/public_html/current/app/Console$

This is meaning that each time I deploy a new release, resque won't stop old workers, but it will start a new one. So I end up with an increasing number of workers hanging around, and each worker is associated with a different one of Capistrano's release paths. This is causing various issues and sometimes causing white screen of death in my app, because when an old worker runs from an old release, Cake will still cache file paths from the old release and you get a situation where Cake is trying to run with some files from the current release, and some files from previous releases.

wa0x6e commented 9 years ago

Does capistrano have permission to stop the workers ? How are you restarting the workers on deployment ? What's the capistrano output when running the command restarting the workers ?

joshuapaling commented 9 years ago

Hi, thanks for the quick response.

Yes, capistrano does have permission.

I've only got one worker (default) and one queue (defaut) going at a time. Initially tried restarting with execute "#{release_path}/app/Console/cake CakeResque.CakeResque restart "

But now I'm instead doing it separately:

execute "#{release_path}/app/Console/cake CakeResque.CakeResque stop"
execute "#{release_path}/app/Console/cake CakeResque.CakeResque start"

I've been trying to see how tied the problem is to Capistrano. I did the following:

ssh into server and start a new worker from within the current symlink. It works, and I can see it when I do ./cake CakeResque.CakeResque stats.

Workers Stats
   Workers count : 1
    REGULAR WORKERS
    * sabre740.anchor.net.au:20918:default
       - Started on     : Tue Oct 14 14:18:43 EST 2014
       - Processed Jobs : 0
       - Failed Jobs    : 0

On my local computer, run bundle exec cap staging deploy. It works, and adds a new release. I temporarily disabled any starting / stopping of workers during the Capistrano deploy.
ssh into server and try to view worker stats from within the current symlink (which now points to the newest release). Resque can no longer see the worker:

Workers Stats
   Workers count : 1
    REGULAR WORKERS
    * sabre740.anchor.net.au:20918:default
       - Started on     : Tue Oct 14 14:18:43 EST 2014
       - Processed Jobs : 0
       - Failed Jobs    : 0

and can also not stop the worker:

staging@sabre740:~/public_html$ ./current/app/Console/cake  CakeResque.CakeResque stop
Stopping workers
   There is no workers to stop ...

Although it is still running:

staging@sabre740:~/public_html$ ps ux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
staging   8685  0.0  0.0  73312  1664 ?        S    13:09   0:00 sshd: staging@pts/1
staging   8686  0.0  0.1  20696  3612 pts/1    Ss   13:09   0:00 -bash
staging  16106  0.0  0.7 161236 15348 ?        S    13:49   0:00 php ./bin/resque
staging  16485  0.0  0.0  73312  1660 ?        S    11:07   0:00 sshd: staging@pts/6
staging  16486  0.0  0.1  20704  3628 pts/6    Ss   11:07   0:00 -bash
staging  19400  0.0  0.7 161236 15348 ?        S    14:01   0:00 php ./bin/resque
staging  20695  0.0  0.0  19224  1112 pts/1    S+   14:16   0:00 redis-cli
staging  20917  0.0  0.0  10780  1416 pts/6    S    14:18   0:00 bash -c cd '/home/staging/public_html/releases/20141014030140/app/Vendor/kamisama/php-resque-ex';     VERBOSE=true  QUEUE='default'  PIDFILE='/home/stag
staging  20918  0.0  0.7 161236 15168 pts/6    S    14:18   0:00 php ./bin/resque
staging  22832  0.0  0.0  16836  1268 pts/6    R+   14:32   0:00 ps ux

I've been battling with this for a few days now. I'm tempted to just kill all processes with 'resque' in the name on each deploy. Though I know that's a terrible solution.

Here's a bunch of info about one of my workers:

cd '/home/staging/public_html/releases/20141014024927/app/Vendor/kamisama/php-resque-ex';     VERBOSE=true  QUEUE='default'  

PIDFILE='/home/staging/public_html/releases/20141014024927/app/Plugin/CakeResque/tmp/14132549791315'  

APP_INCLUDE='/home/staging/public_html/releases/20141014024927/app/Plugin/CakeResque/Lib/CakeResqueBootstrap.php'  

RESQUE_PHP='/home/staging/public_html/releases/20141014024927/app/Vendor/kamisama/php-resque-ex/lib/Resque.php'  INTERVAL=5  REDIS_BACKEND='localhost:6379'  REDIS_DATABASE=1  REDIS_NAMESPACE='resque'  REDIS_PASSWORD=''  

CAKE='/home/staging/public_html/releases/20141014024927/lib/Cake/'  

APP='/home/staging/public_html/releases/20141014024927/app/'  COUNT=1  LOGHANDLER='RotatingFile'  

LOGHANDLERTARGET='/home/staging/public_html/releases/20141014024927/app/tmp/logs/resque.log'  php './bin/resque'  

>> '/home/staging/public_html/releases/20141014024927/app/tmp/logs/resque-worker-error.log'  2>&1

Is it normal to have so many references to the specific release dir (in this case /releases/20141014024927)? I suspect that's part of the issue.

joshuapaling commented 9 years ago

PS, I ssh'd in, and tried finding the path of a resque worker that's definitely running with ps ux, and then executing the stats command on the full path of that worker - but it still gave no results:

staging@sabre740:~/public_html$ /home/staging/public_html/releases/20141014030140/app/Console/cake CakeResque.CakeResque stats

Resque Statistics
---------------------------------------------------------------

Jobs Stats
   Processed Jobs :            0
   Failed Jobs    :            0

Queues Stats
   Queues count : 0

Workers Stats
   Workers count : 0

joshuapaling commented 9 years ago

So it turns out when I try to clear Cake's default cache, it's also clearing a bunch of resque-related keys. I'm using Redis for Cake's caching - I'm going to switch back to the file cache, and I think that should resolve the issue. Thanks for your help.

UPDATE: There was no need to switch from Redis for caching. I was missing a 'prefix' option in my default cache config - and that means that when you try to clear that cache, redis clears ALL keys (it will try to clear all keys matching the prefix - and when that's blank, it'll clear all keys).

Really, if you're using Redis for Cake cache and Resque, you should probably use separate redis databases for each.

wa0x6e commented 9 years ago

You can change Cake-Resque's redis database and prefix in the plugin config file.

And the database clearing when there is no prefix seems seems dangerous. Maybe you can try opening a ticket on the cake repo, and ask to add a check to prevent this.

joshuapaling commented 9 years ago

Done already - https://github.com/cakephp/cakephp/issues/4876

Thanks again for the response, and for this plugin.

wa0x6e / Cake-Resque

Starting / Stopping workers when using Capistrano #71