Jobs stuck in queue - Githubissues

KalimeroJr commented 8 years ago

Awesome work on this!

Unfortunately, after going through the create database (made from uploaded validated phage genbanks) step, jobs are waiting in queue forever. Is there a way to see what it is waiting for or a way to manually run the jobs?

jglamine commented 8 years ago

This will typically happen if the worker process isn't running. Basically the web interface puts jobs in a rabbit queue, and the worker process consumes jobs from that queue.

If you're using the phamdb docker image, the worker process should launch automatically when the container starts. How did you install PhamDB?

KalimeroJr commented 8 years ago

That's what I thought.

I installed it through docker kitematic.. It's the only container running. Can check for a certain process if you want

KalimeroJr commented 8 years ago

Output that I can see through Kitematic:

016-05-02 07:34:06,444 CRIT Supervisor running as root (no user in config file)
2016-05-02 07:34:06,445 WARN Included extra file "/etc/supervisor/conf.d/supervisor-app.conf" during parsing
Unlinking stale socket /var/run/supervisor.sock
2016-05-02 07:34:06,782 INFO RPC interface 'supervisor' initialized
2016-05-02 07:34:06,782 WARN cElementTree not installed, using slower XML parser for XML-RPC
2016-05-02 07:34:06,783 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2016-05-02 07:34:06,784 INFO supervisord started with pid 1
2016-05-02 07:34:07,786 INFO spawned: 'celery-app' with pid 5
2016-05-02 07:34:07,790 INFO spawned: 'nginx-app' with pid 6
2016-05-02 07:34:07,799 INFO spawned: 'rabbitmq' with pid 7
2016-05-02 07:34:07,801 INFO spawned: 'set-secret-key' with pid 8
2016-05-02 07:34:07,811 INFO spawned: 'app-uwsgi' with pid 9
2016-05-02 07:34:07,817 INFO spawned: 'start-mysql' with pid 10
2016-05-02 07:34:07,897 INFO exited: set-secret-key (exit status 0; not expected)
2016-05-02 07:34:08,166 CRIT reaped unknown pid 74)
2016-05-02 07:34:09,167 INFO success: celery-app entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-05-02 07:34:09,168 INFO success: nginx-app entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-05-02 07:34:09,168 INFO success: rabbitmq entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-05-02 07:34:09,170 INFO spawned: 'set-secret-key' with pid 325
2016-05-02 07:34:09,171 INFO success: app-uwsgi entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-05-02 07:34:09,171 INFO success: start-mysql entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-05-02 07:34:09,201 INFO exited: set-secret-key (exit status 0; not expected)
2016-05-02 07:34:09,308 CRIT reaped unknown pid 367)
2016-05-02 07:34:09,851 CRIT reaped unknown pid 452)
2016-05-02 07:34:09,869 CRIT reaped unknown pid 411)
2016-05-02 07:34:09,876 CRIT reaped unknown pid 460)
2016-05-02 07:34:09,969 CRIT reaped unknown pid 482)
2016-05-02 07:34:09,979 CRIT reaped unknown pid 483)
2016-05-02 07:34:10,179 CRIT reaped unknown pid 500)
2016-05-02 07:34:10,180 CRIT reaped unknown pid 501)
2016-05-02 07:34:10,529 CRIT reaped unknown pid 515)
2016-05-02 07:34:11,534 INFO spawned: 'set-secret-key' with pid 546
2016-05-02 07:34:11,575 INFO exited: set-secret-key (exit status 0; not expected)
2016-05-02 07:34:13,136 CRIT reaped unknown pid 648)
2016-05-02 07:34:13,161 CRIT reaped unknown pid 549)
2016-05-02 07:34:14,445 INFO exited: start-mysql (exit status 0; expected)
2016-05-02 07:34:14,630 INFO spawned: 'set-secret-key' with pid 657
2016-05-02 07:34:14,630 CRIT reaped unknown pid 656)
2016-05-02 07:34:14,654 INFO exited: set-secret-key (exit status 0; not expected)
2016-05-02 07:34:14,691 INFO gave up: set-secret-key entered FATAL state, too many start retries too quickly
2016-05-02 07:34:14,692 CRIT reaped unknown pid 655)
2016-05-02 07:34:14,765 CRIT reaped unknown pid 668)
2016-05-02 07:34:14,769 CRIT reaped unknown pid 669)
2016-05-02 07:34:15,238 CRIT reaped unknown pid 702)
2016-05-02 07:34:16,149 CRIT reaped unknown pid 520)
2016-05-02 07:34:19,114 CRIT reaped unknown pid 519)
2016-05-02 07:34:19,133 INFO exited: rabbitmq (exit status 0; expected)

KalimeroJr commented 8 years ago

I also checked the DB, and the webphamerator db contains what looks to be a valid job in the job table:

mysql> select * from job;
+----+-------------+---------------+-------------+-----------------+---------------------+------------+---------+------+---------+-----------+
| id | database_id | database_name | status_code | status_message  | modified            | start_time | runtime | seen | task_id | type_code |
+----+-------------+---------------+-------------+-----------------+---------------------+------------+---------+------+---------+-----------+
|  1 |           1 | PA-phages     | queued      | Waiting to run. | 2016-04-27 11:33:52 | NULL       | NULL    |    1 | NULL    | NULL      |
+----+-------------+---------------+-------------+-----------------+---------------------+------------+---------+------+---------+-----------+
1 row in set (0.00 sec)

KalimeroJr commented 8 years ago

Worker seems to be up:

UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 12:01 ?        00:00:00 /usr/bin/python /usr/bin/supervisord -n
root         5     1  0 12:01 ?        00:00:00 /usr/bin/python /usr/local/bin/celery -A webphamerator.app.celery worker
root         6     1  0 12:01 ?        00:00:00 nginx: master process /usr/sbin/nginx
root        10     1  0 12:01 ?        00:00:00 /usr/local/bin/uwsgi --ini /home/docker/code/docker/uwsgi.ini
www-data    22     6  0 12:01 ?        00:00:00 nginx: worker process
www-data    23     6  0 12:01 ?        00:00:00 nginx: worker process
www-data    24     6  0 12:01 ?        00:00:00 nginx: worker process
www-data    25     6  0 12:01 ?        00:00:00 nginx: worker process
rabbitmq    65     1  0 12:01 ?        00:00:00 /usr/lib/erlang/erts-5.9.1/bin/epmd -daemon
root        79     1  0 12:01 ?        00:00:00 /bin/sh /usr/bin/mysqld_safe
root       423    10  0 12:01 ?        00:00:00 /usr/local/bin/uwsgi --ini /home/docker/code/docker/uwsgi.ini
root       424    10  0 12:01 ?        00:00:00 /usr/local/bin/uwsgi --ini /home/docker/code/docker/uwsgi.ini
root       431     5  0 12:01 ?        00:00:00 /usr/bin/python /usr/local/bin/celery -A webphamerator.app.celery worker
mysql      440    79  0 12:01 ?        00:00:00 /usr/sbin/mysqld --basedir=/usr --datadir=/dockerdata/mysql --plugin-dir=/usr/lib/mysql/plugin --user=
root       441    79  0 12:01 ?        00:00:00 logger -t mysqld -p daemon.error
rabbitmq   448     1  0 12:01 ?        00:00:00 /bin/sh /usr/sbin/rabbitmq-server
rabbitmq   487   448  0 12:01 ?        00:00:00 /usr/lib/erlang/erts-5.9.1/bin/beam -W w -K true -A30 -P 1048576 -- -root /usr/lib/erlang -progname er
rabbitmq   706   487  0 12:01 ?        00:00:00 inet_gethost 4
rabbitmq   707   706  0 12:01 ?        00:00:00 inet_gethost 4
root       717     0  0 12:02 ?        00:00:00 sh
root       723   717  0 12:02 ?        00:00:00 mysql
root       757     0  0 12:04 ?        00:00:00 sh
root       763   757  0 12:05 ?        00:00:00 ps -ef

jglamine commented 8 years ago

Thanks, this is helpful. I'll look into it tonight and see if I can reproduce the issue.

If I remember correctly, the job table is just used to store metadata. It's possible for it to get out of sync with the actual rabbit queue.

My guess is your rabbit queue is empty. I'd suggest manually deleting that row from the Job database. I'm not sure what the database password is, you might want to check the Dockerfile to see how it was set up.

KalimeroJr commented 8 years ago

Regarding the rabbit queue, you are correct:

# rabbitmqctl list_queues
Listing queues ...
celery  0
...done.

I'll delete the entry from the Job table and try again. Looks as if it succeeds in populating the Job table but either it doesn't successfully send the message through the queue or the celery worker accepts the message (and so, empties the queue) but fails during creation of the database..?

KalimeroJr commented 8 years ago

OK, found the culprit. It's all because of a low memory alarm set by rabbitmq.

Noticed this in the rabbitmq log:

=INFO REPORT==== 2-May-2016::16:01:39 ===
Memory limit set to 398MB of 996MB total.

=INFO REPORT==== 2-May-2016::16:01:39 ===
vm_memory_high_watermark set. Memory used:595445168 allowed:417785446

=INFO REPORT==== 2-May-2016::16:01:39 ===
    alarm_handler: {set,{{resource_limit,memory,rabbit@e6ca7f886fe7},[]}}

=INFO REPORT==== 2-May-2016::16:01:39 ===
Disk free limit set to 953MB

The memory alarm caused even the "Hello World" example of rabbitMQ to fail.

By temporarily raising the threshold to 0.7 of total memory : rabbitmqctl set_vm_memory_high_watermark 0.7

The queue was freed and now phamDB creates databases successfully.

I'm not sure why the image has such high system memory usage by rabbitmq:

# rabbitmqctl status
Status of node rabbit@e6ca7f886fe7 ...
[{pid,4068},
 {running_applications,[{rabbit,"RabbitMQ","2.8.4"},
                        {mnesia,"MNESIA  CXC 138 12","4.7"},
                        {os_mon,"CPO  CXC 138 46","2.2.9"},
                        {sasl,"SASL  CXC 138 11","2.2.1"},
                        {stdlib,"ERTS  CXC 138 10","1.18.1"},
                        {kernel,"ERTS  CXC 138 10","2.15.1"}]},
 {os,{unix,linux}},
 {erlang_version,"Erlang R15B01 (erts-5.9.1) [source] [64-bit] [async-threads:30] [kernel-poll:true]\n"},
 {memory,[{total,596007536},
          {processes,10418222},
          {processes_used,10418208},
          {system,585589314},
          {atom,504409},
          {atom_used,473967},
          {binary,103168},
          {code,11874771},
          {ets,754232}]},
 {vm_memory_high_watermark,0.6999999998085141},
 {vm_memory_limit,731124531},
 {disk_free_limit,1000000000},
 {disk_free,12889649152},
 {file_descriptors,[{total_limit,1048476},
                    {total_used,7},
                    {sockets_limit,943626},
                    {sockets_used,4}]},
 {processes,[{limit,1048576},{used,154}]},
 {run_queue,0},
 {uptime,512}]
...done.

Anyway - maybe adding a test of the rabbitmq server could be beneficial, or changing the memory settings of it's memory usage/thresholds...

jglamine commented 8 years ago

Thanks for digging into this more deeply, I'm glad you figured it out!

I agree that we should try to prevent this by changing the rabbitmq config and also marking jobs as "failed" if the queue isn't working.

I don't have time to actively develop PhamDB anymore, but if you want to volunteer to submit a pull request, I'd be happy to walk you though the process. Either way, thanks for submitting such a detailed bug report. If anyone wants to tackle it, there should be enough information to reproduce it.

KalimeroJr commented 8 years ago

It seems time is a resource we are both low on :)

I'll just leave this here and close the "issue" for now..

susanruimingao commented 6 years ago

Hello, I understand that this issue has been closed. Unfortunately I encountered the same problem two years later. As I am quite new to PhamDB and rabbitmq, could you kindly tell me how I can view the rabbitmq log please? I would like to know if I have the similar "low memory alarm set by rabbitmq".

Sorry for two more simple questions: 1: where can I assess this "mysql> select * from job"? 2: How can I temporarily raise the threshold to 0.7 of total memory : rabbitmqctl set_vm_memory_high_watermark 0.7

Is it manipulated in the VirtualBox terminal, Docker Toolbox or Kitematic?

Many thanks for your time and kind response.

jglamine / phamdb

Jobs stuck in queue #5