thieman / dagobah

Simple DAG-based job scheduler in Python
Do What The F*ck You Want To Public License
755 stars 160 forks source link

Flask UI broken , probably after load peaks #100

Open nnfuzzy opened 10 years ago

nnfuzzy commented 10 years ago

Hi,

sometimes (actually more often) I can't reach the UI anymore. My suspicion is a peak in load on the server which broke flask UI. In the log I found only the last 200's.

INFO:werkzeug:... - - [13/Jun/2014 08:37:17] "GET /api/job?jobname=DMProcessing HTTP/1.1" 200 - INFO:werkzeug:..._ - - [13/Jun/2014 08:37:19] "GET /api/job?jobname=DMProcessing HTTP/1.1" 200 - INFO:werkzeug:..._ - - [13/Jun/2014 08:37:20] "GET /api/job?jobname=DMProcessing HTTP/1.1" 200 - INFO:werkzeug:..._ - - [13/Jun/2014 08:37:22] "GET /api/job?jobname=DMProcessing HTTP/1.1" 200 - INFO:werkzeug:..._ - - [13/Jun/2014 08:37:23] "GET /api/job?job_name=DMProcessing HTTP/1.1" 200 -

I use mongodb backend and dagobah collections are in a separate db.

Many thanks for a hint Christian

rclough commented 10 years ago

When you say the UI, do you mean when you visit the dagobah page in a web browser, it doesnt load? Or that the page loads, but the page doesn't do anything?

nnfuzzy commented 10 years ago

Yes , the first one. But I got no 404 or smth. else. When it occurs next time I'll make screenshot from the page and the process.

rclough commented 10 years ago

It my be useful if you can open the developer tools in whatever browser you have (I know chrome/firefox/safari have similar options) and look at the network tab. That way, when the page fails to load, you can see what network call is failing

nnfuzzy commented 10 years ago

Yes I'll do and try to force getting this event, because sometime it's ok for weeks. One idea is , it has smth. to do with the status job reload (open browser) during a high load on the server?

nnfuzzy commented 10 years ago

Yesterday I had again this issue. I used the network tab in chrome and problem is that flask don't able to response , so no request information. But it's not like the "webserver" is offline.

thieman commented 10 years ago

The proper solution here is probably to serve the app through a legit webserver (probably gunicorn or something) rather than Flask's built-in dev server. The Flask request thread must be dying for some reason and never getting restarted.

nnfuzzy commented 10 years ago

Good point. Perhaps with supervisord incl. it is possible getting more log information...

hussainsultan commented 10 years ago

I am having the same issue and i am going to try running it with gunicorn and see. Thanks!

thieman commented 10 years ago

Just make sure you only run 1 process if you run it behind something like gunicorn (which supports multiple app processes). Otherwise you'll also spin up multiple scheduler threads, and you don't want that.

zhenlongbai commented 9 years ago

I had the same issue and I run it behind gunicorn . But it did't work.

It's ok for days , but today ,when I added a job ,dagobah_jobs didn't get a an update for next_run. It did't happen everytime , when i add a job .

thieman commented 9 years ago

@zhenlongbai Are you able to retrieve the logs from that point? We've added a bunch of logging since this issue was originally reported. Additionally, since you're running into so many issues, it would probably be helpful to set your logging level to debug in your config file.

zhenlongbai commented 9 years ago

Ok , I have used Dogbah on my work,and it run very well for days .The logs had 89350 lines and I will change the logging level to debug to wirite a new log.

I had change some code to make it works well for my job. for example ,utc time and email .

Thanks for you help!

zhenlongbai commented 9 years ago

today I had again this issue , when I add a job .

When I click "start job from begin" ,it work once and don't get a an update for next_run automatic。

my start script : nohup gunicorn -b 0.0.0.0:9876 -w 1 dagobah_app:app &

my log :

[2015-04-22 12:46:37 +0000] [16527] [INFO] Worker exiting (pid: 16527)
[2015-04-22 12:46:37 +0000] [16522] [INFO] Handling signal: term
[2015-04-22 12:46:37 +0000] [16522] [INFO] Shutting down: Master
[2015-04-22 12:46:39 +0000] [20901] [INFO] Starting gunicorn 19.3.0
[2015-04-22 12:46:39 +0000] [20901] [INFO] Listening at: http://0.0.0.0:9876 (20901)
[2015-04-22 12:46:39 +0000] [20901] [INFO] Using worker: sync
[2015-04-22 12:46:39 +0000] [20906] [INFO] Booting worker with pid: 20906
/usr/local/lib/python2.7/site-packages/Crypto/Util/number.py:57: PowmInsecureWarning: Not using mpz_powm_sec.  You should rebuild using libgmp >= 5 to avoid timing attack vulnerability.
  _warn("Not using mpz_powm_sec.  You should rebuild using libgmp >= 5 to avoid timing attack vulnerability.", PowmInsecureWarning)
Logging output to /home/brdwork/logs/dagobah.log
Logger initialized at level DEBUG
Package pymongo has version 3.0 which is later than specified version 2.5. If you experience issues, try downgrading to version 2.5.
Starting app on 0.0.0.0:9876
Connected (version 2.0, client OpenSSH_4.3)
Authentication (publickey) successful!
Secsh channel 1 opened.
Connected (version 2.0, client OpenSSH_4.3)
Authentication (publickey) successful!
Secsh channel 1 opened.
Connected (version 2.0, client OpenSSH_4.3)
Authentication (publickey) successful!
Secsh channel 1 opened.
Connected (version 2.0, client OpenSSH_4.3)
Authentication (publickey) successful!
Secsh channel 1 opened.
Connected (version 2.0, client OpenSSH_4.3)
Authentication (publickey) successful!
Secsh channel 1 opened.
Connected (version 2.0, client OpenSSH_4.3)
Authentication (publickey) successful!
Secsh channel 1 opened.
Connected (version 2.0, client OpenSSH_4.3)
Authentication (publickey) successful!
Secsh channel 1 opened.
Connected (version 2.0, client OpenSSH_4.3)
Authentication (publickey) successful!
Secsh channel 1 opened.
Connected (version 2.0, client OpenSSH_4.3)
Authentication (publickey) successful!
Secsh channel 1 opened.
Exception in thread Thread-3:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/threading.py", line 551, in __bootstrap_inner
    self.run()
  File "/usr/local/lib/python2.7/site-packages/dagobah/core/components.py", line 114, in run
    job.start()
  File "/usr/local/lib/python2.7/site-packages/dagobah/core/core.py", line 387, in start
    self.initialize_snapshot()
  File "/usr/local/lib/python2.7/site-packages/dagobah/core/core.py", line 672, in initialize_snapshot
    raise DagobahError(reason)
DagobahError: no independent nodes detected
zhenlongbai commented 9 years ago

I can also find the command : [brdwork@recbox04 shell_dagobah]$ ps aux | grep gunicorn brdwork 20901 0.0 0.0 162228 12480 pts/3 S 12:46 0:00 /usr/local/bin/python /usr/local/bin/gunicorn -b 0.0.0.0:9876 -w 1 dagobah_app:app brdwork 20906 0.5 0.0 379216 29808 pts/3 Sl 12:46 0:06 /usr/local/bin/python /usr/local/bin/gunicorn -b 0.0.0.0:9876 -w 1 dagobah_app:app brdwork 22295 0.0 0.0 61228 784 pts/4 R+ 13:05 0:00 grep gunicorn [brdwork@recbox04 shell_dagobah]$

zhenlongbai commented 9 years ago

This is my DEBUG log. I think ' DEBUG:paramiko.transport:EOF in transport thread ' is the key info. When the thread isn't EOF , dagobah_jobs don't get a an update.

DEBUG:paramiko.transport:starting thread (client mode): 0x5ea7b10L
INFO:paramiko.transport:Connected (version 2.0, client OpenSSH_4.3)
DEBUG:paramiko.transport:kex algos:['diffie-hellman-group-exchange-sha1', 'diffie-hellman-group14-sha1', 'diffie-hellman-group1-sha1'] server key:['ssh-rsa', 'ssh-dss'] client encrypt:['aes128-ctr', 'aes192-ctr', 'aes256-ctr', 'arcfour256', 'arcfour128', 'aes128-cbc', '3des-cbc', 'blowfish-cbc', 'cast128-cbc', 'aes192-cbc', 'aes256-cbc', 'arcfour', 'rijndael-cbc@lysator.liu.se'] server encrypt:['aes128-ctr', 'aes192-ctr', 'aes256-ctr', 'arcfour256', 'arcfour128', 'aes128-cbc', '3des-cbc', 'blowfish-cbc', 'cast128-cbc', 'aes192-cbc', 'aes256-cbc', 'arcfour', 'rijndael-cbc@lysator.liu.se'] client mac:['hmac-md5', 'hmac-sha1', 'hmac-ripemd160', 'hmac-ripemd160@openssh.com', 'hmac-sha1-96', 'hmac-md5-96'] server mac:['hmac-md5', 'hmac-sha1', 'hmac-ripemd160', 'hmac-ripemd160@openssh.com', 'hmac-sha1-96', 'hmac-md5-96'] client compress:['none', 'zlib@openssh.com'] server compress:['none', 'zlib@openssh.com'] client lang:[''] server lang:[''] kex follows?False
DEBUG:paramiko.transport:Ciphers agreed: local=aes128-ctr, remote=aes128-ctr
DEBUG:paramiko.transport:using kex diffie-hellman-group1-sha1; server key type ssh-rsa; cipher: local aes128-ctr, remote aes128-ctr; mac: local hmac-sha1, remote hmac-sha1; compression: local none, remote none
DEBUG:paramiko.transport:Switch to new keys ...
DEBUG:paramiko.transport:Trying key a6f65c1f81dafe5b3fb0d897ccf342b2 from /home/brdwork/.ssh/id_rsa
DEBUG:paramiko.transport:userauth is OK
INFO:paramiko.transport:Authentication (publickey) successful!
DEBUG:paramiko.transport:[chan 1] Max packet in: 34816 bytes
DEBUG:paramiko.transport:[chan 1] Max packet out: 32768 bytes
INFO:paramiko.transport:Secsh channel 1 opened.
DEBUG:paramiko.transport:[chan 1] Sesch channel 1 request ok
DEBUG:paramiko.transport:[chan 1] Sesch channel 1 request ok
DEBUG:paramiko.transport:[chan 1] EOF received (1)
DEBUG:paramiko.transport:[chan 1] EOF sent (1)
DEBUG:paramiko.transport:EOF in transport thread
DEBUG:paramiko.transport:starting thread (client mode): 0x5ea7b90L
INFO:paramiko.transport:Connected (version 2.0, client OpenSSH_4.3)
DEBUG:paramiko.transport:kex algos:['diffie-hellman-group-exchange-sha1', 'diffie-hellman-group14-sha1', 'diffie-hellman-group1-sha1'] server key:['ssh-rsa', 'ssh-dss'] client encrypt:['aes128-ctr', 'aes192-ctr', 'aes256-ctr', 'arcfour256', 'arcfour128', 'aes128-cbc', '3des-cbc', 'blowfish-cbc', 'cast128-cbc', 'aes192-cbc', 'aes256-cbc', 'arcfour', 'rijndael-cbc@lysator.liu.se'] server encrypt:['aes128-ctr', 'aes192-ctr', 'aes256-ctr', 'arcfour256', 'arcfour128', 'aes128-cbc', '3des-cbc', 'blowfish-cbc', 'cast128-cbc', 'aes192-cbc', 'aes256-cbc', 'arcfour', 'rijndael-cbc@lysator.liu.se'] client mac:['hmac-md5', 'hmac-sha1', 'hmac-ripemd160', 'hmac-ripemd160@openssh.com', 'hmac-sha1-96', 'hmac-md5-96'] server mac:['hmac-md5', 'hmac-sha1', 'hmac-ripemd160', 'hmac-ripemd160@openssh.com', 'hmac-sha1-96', 'hmac-md5-96'] client compress:['none', 'zlib@openssh.com'] server compress:['none', 'zlib@openssh.com'] client lang:[''] server lang:[''] kex follows?False
DEBUG:paramiko.transport:Ciphers agreed: local=aes128-ctr, remote=aes128-ctr
DEBUG:paramiko.transport:using kex diffie-hellman-group1-sha1; server key type ssh-rsa; cipher: local aes128-ctr, remote aes128-ctr; mac: local hmac-sha1, remote hmac-sha1; compression: local none, remote none
DEBUG:paramiko.transport:Switch to new keys ...
DEBUG:paramiko.transport:Trying key a6f65c1f81dafe5b3fb0d897ccf342b2 from /home/brdwork/.ssh/id_rsa
DEBUG:paramiko.transport:userauth is OK
INFO:paramiko.transport:Authentication (publickey) successful!
DEBUG:paramiko.transport:[chan 1] Max packet in: 34816 bytes
DEBUG:paramiko.transport:[chan 1] Max packet out: 32768 bytes
INFO:paramiko.transport:Secsh channel 1 opened.
DEBUG:paramiko.transport:[chan 1] Sesch channel 1 request ok
DEBUG:paramiko.transport:[chan 1] Sesch channel 1 request ok
DEBUG:paramiko.transport:[chan 1] EOF received (1)
DEBUG:paramiko.transport:[chan 1] EOF sent (1)
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
BruceDone commented 7 years ago

i will try to use the supervisord to see if it will broken again .

update 2016-12-30

my solution is use docker , and use cron to restart it every hour , then currently it works well ,but should find the deep reason why the ui broken.