gsksivesh / dagobah

Simple DAG-based job scheduler in Python
Do What The F*ck You Want To Public License
2 stars 1 forks source link

Flask UI broken , probably after load peaks #30

Open gsksivesh opened 5 years ago

gsksivesh commented 5 years ago

Issue by nnfuzzy Friday Jun 13, 2014 at 08:50 GMT Originally opened as https://github.com/thieman/dagobah/issues/100


Hi,

sometimes (actually more often) I can't reach the UI anymore. My suspicion is a peak in load on the server which broke flask UI. In the log I found only the last 200's.

INFO:werkzeug:... - - [13/Jun/2014 08:37:17] "GET /api/job?jobname=DMProcessing HTTP/1.1" 200 - INFO:werkzeug:..._ - - [13/Jun/2014 08:37:19] "GET /api/job?jobname=DMProcessing HTTP/1.1" 200 - INFO:werkzeug:..._ - - [13/Jun/2014 08:37:20] "GET /api/job?jobname=DMProcessing HTTP/1.1" 200 - INFO:werkzeug:..._ - - [13/Jun/2014 08:37:22] "GET /api/job?jobname=DMProcessing HTTP/1.1" 200 - INFO:werkzeug:..._ - - [13/Jun/2014 08:37:23] "GET /api/job?job_name=DMProcessing HTTP/1.1" 200 -

I use mongodb backend and dagobah collections are in a separate db.

Many thanks for a hint Christian

gsksivesh commented 5 years ago

Comment by rclough Monday Jun 16, 2014 at 19:41 GMT


When you say the UI, do you mean when you visit the dagobah page in a web browser, it doesnt load? Or that the page loads, but the page doesn't do anything?

gsksivesh commented 5 years ago

Comment by nnfuzzy Friday Jun 27, 2014 at 09:36 GMT


Yes , the first one. But I got no 404 or smth. else. When it occurs next time I'll make screenshot from the page and the process.

gsksivesh commented 5 years ago

Comment by rclough Friday Jun 27, 2014 at 14:43 GMT


It my be useful if you can open the developer tools in whatever browser you have (I know chrome/firefox/safari have similar options) and look at the network tab. That way, when the page fails to load, you can see what network call is failing

gsksivesh commented 5 years ago

Comment by nnfuzzy Friday Jun 27, 2014 at 15:02 GMT


Yes I'll do and try to force getting this event, because sometime it's ok for weeks. One idea is , it has smth. to do with the status job reload (open browser) during a high load on the server?

gsksivesh commented 5 years ago

Comment by nnfuzzy Friday Jul 25, 2014 at 06:45 GMT


Yesterday I had again this issue. I used the network tab in chrome and problem is that flask don't able to response , so no request information. But it's not like the "webserver" is offline.

gsksivesh commented 5 years ago

Comment by thieman Friday Jul 25, 2014 at 12:19 GMT


The proper solution here is probably to serve the app through a legit webserver (probably gunicorn or something) rather than Flask's built-in dev server. The Flask request thread must be dying for some reason and never getting restarted.

gsksivesh commented 5 years ago

Comment by nnfuzzy Monday Jul 28, 2014 at 13:40 GMT


Good point. Perhaps with supervisord incl. it is possible getting more log information...

gsksivesh commented 5 years ago

Comment by hussainsultan Saturday Aug 02, 2014 at 04:26 GMT


I am having the same issue and i am going to try running it with gunicorn and see. Thanks!

gsksivesh commented 5 years ago

Comment by thieman Saturday Aug 02, 2014 at 12:02 GMT


Just make sure you only run 1 process if you run it behind something like gunicorn (which supports multiple app processes). Otherwise you'll also spin up multiple scheduler threads, and you don't want that.

gsksivesh commented 5 years ago

Comment by zhenlongbai Tuesday Apr 21, 2015 at 06:47 GMT


I had the same issue and I run it behind gunicorn . But it did't work.

It's ok for days , but today ,when I added a job ,dagobah_jobs didn't get a an update for next_run. It did't happen everytime , when i add a job .

gsksivesh commented 5 years ago

Comment by thieman Tuesday Apr 21, 2015 at 12:22 GMT


@zhenlongbai Are you able to retrieve the logs from that point? We've added a bunch of logging since this issue was originally reported. Additionally, since you're running into so many issues, it would probably be helpful to set your logging level to debug in your config file.

gsksivesh commented 5 years ago

Comment by zhenlongbai Wednesday Apr 22, 2015 at 02:21 GMT


Ok , I have used Dogbah on my work,and it run very well for days .The logs had 89350 lines and I will change the logging level to debug to wirite a new log.

I had change some code to make it works well for my job. for example ,utc time and email .

Thanks for you help!

gsksivesh commented 5 years ago

Comment by zhenlongbai Wednesday Apr 22, 2015 at 05:04 GMT


today I had again this issue , when I add a job .

When I click "start job from begin" ,it work once and don't get a an update for next_run automatic。

my start script : nohup gunicorn -b 0.0.0.0:9876 -w 1 dagobah_app:app &

my log :

[2015-04-22 12:46:37 +0000] [16527] [INFO] Worker exiting (pid: 16527)
[2015-04-22 12:46:37 +0000] [16522] [INFO] Handling signal: term
[2015-04-22 12:46:37 +0000] [16522] [INFO] Shutting down: Master
[2015-04-22 12:46:39 +0000] [20901] [INFO] Starting gunicorn 19.3.0
[2015-04-22 12:46:39 +0000] [20901] [INFO] Listening at: http://0.0.0.0:9876 (20901)
[2015-04-22 12:46:39 +0000] [20901] [INFO] Using worker: sync
[2015-04-22 12:46:39 +0000] [20906] [INFO] Booting worker with pid: 20906
/usr/local/lib/python2.7/site-packages/Crypto/Util/number.py:57: PowmInsecureWarning: Not using mpz_powm_sec.  You should rebuild using libgmp >= 5 to avoid timing attack vulnerability.
  _warn("Not using mpz_powm_sec.  You should rebuild using libgmp >= 5 to avoid timing attack vulnerability.", PowmInsecureWarning)
Logging output to /home/brdwork/logs/dagobah.log
Logger initialized at level DEBUG
Package pymongo has version 3.0 which is later than specified version 2.5. If you experience issues, try downgrading to version 2.5.
Starting app on 0.0.0.0:9876
Connected (version 2.0, client OpenSSH_4.3)
Authentication (publickey) successful!
Secsh channel 1 opened.
Connected (version 2.0, client OpenSSH_4.3)
Authentication (publickey) successful!
Secsh channel 1 opened.
Connected (version 2.0, client OpenSSH_4.3)
Authentication (publickey) successful!
Secsh channel 1 opened.
Connected (version 2.0, client OpenSSH_4.3)
Authentication (publickey) successful!
Secsh channel 1 opened.
Connected (version 2.0, client OpenSSH_4.3)
Authentication (publickey) successful!
Secsh channel 1 opened.
Connected (version 2.0, client OpenSSH_4.3)
Authentication (publickey) successful!
Secsh channel 1 opened.
Connected (version 2.0, client OpenSSH_4.3)
Authentication (publickey) successful!
Secsh channel 1 opened.
Connected (version 2.0, client OpenSSH_4.3)
Authentication (publickey) successful!
Secsh channel 1 opened.
Connected (version 2.0, client OpenSSH_4.3)
Authentication (publickey) successful!
Secsh channel 1 opened.
Exception in thread Thread-3:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/threading.py", line 551, in __bootstrap_inner
    self.run()
  File "/usr/local/lib/python2.7/site-packages/dagobah/core/components.py", line 114, in run
    job.start()
  File "/usr/local/lib/python2.7/site-packages/dagobah/core/core.py", line 387, in start
    self.initialize_snapshot()
  File "/usr/local/lib/python2.7/site-packages/dagobah/core/core.py", line 672, in initialize_snapshot
    raise DagobahError(reason)
DagobahError: no independent nodes detected
gsksivesh commented 5 years ago

Comment by zhenlongbai Wednesday Apr 22, 2015 at 05:07 GMT


I can also find the command : [brdwork@recbox04 shell_dagobah]$ ps aux | grep gunicorn brdwork 20901 0.0 0.0 162228 12480 pts/3 S 12:46 0:00 /usr/local/bin/python /usr/local/bin/gunicorn -b 0.0.0.0:9876 -w 1 dagobah_app:app brdwork 20906 0.5 0.0 379216 29808 pts/3 Sl 12:46 0:06 /usr/local/bin/python /usr/local/bin/gunicorn -b 0.0.0.0:9876 -w 1 dagobah_app:app brdwork 22295 0.0 0.0 61228 784 pts/4 R+ 13:05 0:00 grep gunicorn [brdwork@recbox04 shell_dagobah]$

gsksivesh commented 5 years ago

Comment by zhenlongbai Wednesday Apr 22, 2015 at 06:31 GMT


This is my DEBUG log. I think ' DEBUG:paramiko.transport:EOF in transport thread ' is the key info. When the thread isn't EOF , dagobah_jobs don't get a an update.

DEBUG:paramiko.transport:starting thread (client mode): 0x5ea7b10L
INFO:paramiko.transport:Connected (version 2.0, client OpenSSH_4.3)
DEBUG:paramiko.transport:kex algos:['diffie-hellman-group-exchange-sha1', 'diffie-hellman-group14-sha1', 'diffie-hellman-group1-sha1'] server key:['ssh-rsa', 'ssh-dss'] client encrypt:['aes128-ctr', 'aes192-ctr', 'aes256-ctr', 'arcfour256', 'arcfour128', 'aes128-cbc', '3des-cbc', 'blowfish-cbc', 'cast128-cbc', 'aes192-cbc', 'aes256-cbc', 'arcfour', 'rijndael-cbc@lysator.liu.se'] server encrypt:['aes128-ctr', 'aes192-ctr', 'aes256-ctr', 'arcfour256', 'arcfour128', 'aes128-cbc', '3des-cbc', 'blowfish-cbc', 'cast128-cbc', 'aes192-cbc', 'aes256-cbc', 'arcfour', 'rijndael-cbc@lysator.liu.se'] client mac:['hmac-md5', 'hmac-sha1', 'hmac-ripemd160', 'hmac-ripemd160@openssh.com', 'hmac-sha1-96', 'hmac-md5-96'] server mac:['hmac-md5', 'hmac-sha1', 'hmac-ripemd160', 'hmac-ripemd160@openssh.com', 'hmac-sha1-96', 'hmac-md5-96'] client compress:['none', 'zlib@openssh.com'] server compress:['none', 'zlib@openssh.com'] client lang:[''] server lang:[''] kex follows?False
DEBUG:paramiko.transport:Ciphers agreed: local=aes128-ctr, remote=aes128-ctr
DEBUG:paramiko.transport:using kex diffie-hellman-group1-sha1; server key type ssh-rsa; cipher: local aes128-ctr, remote aes128-ctr; mac: local hmac-sha1, remote hmac-sha1; compression: local none, remote none
DEBUG:paramiko.transport:Switch to new keys ...
DEBUG:paramiko.transport:Trying key a6f65c1f81dafe5b3fb0d897ccf342b2 from /home/brdwork/.ssh/id_rsa
DEBUG:paramiko.transport:userauth is OK
INFO:paramiko.transport:Authentication (publickey) successful!
DEBUG:paramiko.transport:[chan 1] Max packet in: 34816 bytes
DEBUG:paramiko.transport:[chan 1] Max packet out: 32768 bytes
INFO:paramiko.transport:Secsh channel 1 opened.
DEBUG:paramiko.transport:[chan 1] Sesch channel 1 request ok
DEBUG:paramiko.transport:[chan 1] Sesch channel 1 request ok
DEBUG:paramiko.transport:[chan 1] EOF received (1)
DEBUG:paramiko.transport:[chan 1] EOF sent (1)
DEBUG:paramiko.transport:EOF in transport thread
DEBUG:paramiko.transport:starting thread (client mode): 0x5ea7b90L
INFO:paramiko.transport:Connected (version 2.0, client OpenSSH_4.3)
DEBUG:paramiko.transport:kex algos:['diffie-hellman-group-exchange-sha1', 'diffie-hellman-group14-sha1', 'diffie-hellman-group1-sha1'] server key:['ssh-rsa', 'ssh-dss'] client encrypt:['aes128-ctr', 'aes192-ctr', 'aes256-ctr', 'arcfour256', 'arcfour128', 'aes128-cbc', '3des-cbc', 'blowfish-cbc', 'cast128-cbc', 'aes192-cbc', 'aes256-cbc', 'arcfour', 'rijndael-cbc@lysator.liu.se'] server encrypt:['aes128-ctr', 'aes192-ctr', 'aes256-ctr', 'arcfour256', 'arcfour128', 'aes128-cbc', '3des-cbc', 'blowfish-cbc', 'cast128-cbc', 'aes192-cbc', 'aes256-cbc', 'arcfour', 'rijndael-cbc@lysator.liu.se'] client mac:['hmac-md5', 'hmac-sha1', 'hmac-ripemd160', 'hmac-ripemd160@openssh.com', 'hmac-sha1-96', 'hmac-md5-96'] server mac:['hmac-md5', 'hmac-sha1', 'hmac-ripemd160', 'hmac-ripemd160@openssh.com', 'hmac-sha1-96', 'hmac-md5-96'] client compress:['none', 'zlib@openssh.com'] server compress:['none', 'zlib@openssh.com'] client lang:[''] server lang:[''] kex follows?False
DEBUG:paramiko.transport:Ciphers agreed: local=aes128-ctr, remote=aes128-ctr
DEBUG:paramiko.transport:using kex diffie-hellman-group1-sha1; server key type ssh-rsa; cipher: local aes128-ctr, remote aes128-ctr; mac: local hmac-sha1, remote hmac-sha1; compression: local none, remote none
DEBUG:paramiko.transport:Switch to new keys ...
DEBUG:paramiko.transport:Trying key a6f65c1f81dafe5b3fb0d897ccf342b2 from /home/brdwork/.ssh/id_rsa
DEBUG:paramiko.transport:userauth is OK
INFO:paramiko.transport:Authentication (publickey) successful!
DEBUG:paramiko.transport:[chan 1] Max packet in: 34816 bytes
DEBUG:paramiko.transport:[chan 1] Max packet out: 32768 bytes
INFO:paramiko.transport:Secsh channel 1 opened.
DEBUG:paramiko.transport:[chan 1] Sesch channel 1 request ok
DEBUG:paramiko.transport:[chan 1] Sesch channel 1 request ok
DEBUG:paramiko.transport:[chan 1] EOF received (1)
DEBUG:paramiko.transport:[chan 1] EOF sent (1)
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
gsksivesh commented 5 years ago

Comment by BruceDone Thursday Dec 29, 2016 at 02:58 GMT


i will try to use the supervisord to see if it will broken again .

update 2016-12-30

my solution is use docker , and use cron to restart it every hour , then currently it works well ,but should find the deep reason why the ui broken.