Open nnfuzzy opened 10 years ago
When you say the UI, do you mean when you visit the dagobah page in a web browser, it doesnt load? Or that the page loads, but the page doesn't do anything?
Yes , the first one. But I got no 404 or smth. else. When it occurs next time I'll make screenshot from the page and the process.
It my be useful if you can open the developer tools in whatever browser you have (I know chrome/firefox/safari have similar options) and look at the network tab. That way, when the page fails to load, you can see what network call is failing
Yes I'll do and try to force getting this event, because sometime it's ok for weeks. One idea is , it has smth. to do with the status job reload (open browser) during a high load on the server?
Yesterday I had again this issue. I used the network tab in chrome and problem is that flask don't able to response , so no request information. But it's not like the "webserver" is offline.
The proper solution here is probably to serve the app through a legit webserver (probably gunicorn or something) rather than Flask's built-in dev server. The Flask request thread must be dying for some reason and never getting restarted.
Good point. Perhaps with supervisord incl. it is possible getting more log information...
I am having the same issue and i am going to try running it with gunicorn and see. Thanks!
Just make sure you only run 1 process if you run it behind something like gunicorn (which supports multiple app processes). Otherwise you'll also spin up multiple scheduler threads, and you don't want that.
I had the same issue and I run it behind gunicorn . But it did't work.
It's ok for days , but today ,when I added a job ,dagobah_jobs didn't get a an update for next_run. It did't happen everytime , when i add a job .
@zhenlongbai Are you able to retrieve the logs from that point? We've added a bunch of logging since this issue was originally reported. Additionally, since you're running into so many issues, it would probably be helpful to set your logging level to debug
in your config file.
Ok , I have used Dogbah on my work,and it run very well for days .The logs had 89350 lines and I will change the logging level to debug to wirite a new log.
I had change some code to make it works well for my job. for example ,utc time and email .
Thanks for you help!
today I had again this issue , when I add a job .
When I click "start job from begin" ,it work once and don't get a an update for next_run automatic。
my start script : nohup gunicorn -b 0.0.0.0:9876 -w 1 dagobah_app:app &
my log :
[2015-04-22 12:46:37 +0000] [16527] [INFO] Worker exiting (pid: 16527)
[2015-04-22 12:46:37 +0000] [16522] [INFO] Handling signal: term
[2015-04-22 12:46:37 +0000] [16522] [INFO] Shutting down: Master
[2015-04-22 12:46:39 +0000] [20901] [INFO] Starting gunicorn 19.3.0
[2015-04-22 12:46:39 +0000] [20901] [INFO] Listening at: http://0.0.0.0:9876 (20901)
[2015-04-22 12:46:39 +0000] [20901] [INFO] Using worker: sync
[2015-04-22 12:46:39 +0000] [20906] [INFO] Booting worker with pid: 20906
/usr/local/lib/python2.7/site-packages/Crypto/Util/number.py:57: PowmInsecureWarning: Not using mpz_powm_sec. You should rebuild using libgmp >= 5 to avoid timing attack vulnerability.
_warn("Not using mpz_powm_sec. You should rebuild using libgmp >= 5 to avoid timing attack vulnerability.", PowmInsecureWarning)
Logging output to /home/brdwork/logs/dagobah.log
Logger initialized at level DEBUG
Package pymongo has version 3.0 which is later than specified version 2.5. If you experience issues, try downgrading to version 2.5.
Starting app on 0.0.0.0:9876
Connected (version 2.0, client OpenSSH_4.3)
Authentication (publickey) successful!
Secsh channel 1 opened.
Connected (version 2.0, client OpenSSH_4.3)
Authentication (publickey) successful!
Secsh channel 1 opened.
Connected (version 2.0, client OpenSSH_4.3)
Authentication (publickey) successful!
Secsh channel 1 opened.
Connected (version 2.0, client OpenSSH_4.3)
Authentication (publickey) successful!
Secsh channel 1 opened.
Connected (version 2.0, client OpenSSH_4.3)
Authentication (publickey) successful!
Secsh channel 1 opened.
Connected (version 2.0, client OpenSSH_4.3)
Authentication (publickey) successful!
Secsh channel 1 opened.
Connected (version 2.0, client OpenSSH_4.3)
Authentication (publickey) successful!
Secsh channel 1 opened.
Connected (version 2.0, client OpenSSH_4.3)
Authentication (publickey) successful!
Secsh channel 1 opened.
Connected (version 2.0, client OpenSSH_4.3)
Authentication (publickey) successful!
Secsh channel 1 opened.
Exception in thread Thread-3:
Traceback (most recent call last):
Traceback (most recent call last):
File "/usr/local/lib/python2.7/threading.py", line 551, in __bootstrap_inner
self.run()
File "/usr/local/lib/python2.7/site-packages/dagobah/core/components.py", line 114, in run
job.start()
File "/usr/local/lib/python2.7/site-packages/dagobah/core/core.py", line 387, in start
self.initialize_snapshot()
File "/usr/local/lib/python2.7/site-packages/dagobah/core/core.py", line 672, in initialize_snapshot
raise DagobahError(reason)
DagobahError: no independent nodes detected
I can also find the command : [brdwork@recbox04 shell_dagobah]$ ps aux | grep gunicorn brdwork 20901 0.0 0.0 162228 12480 pts/3 S 12:46 0:00 /usr/local/bin/python /usr/local/bin/gunicorn -b 0.0.0.0:9876 -w 1 dagobah_app:app brdwork 20906 0.5 0.0 379216 29808 pts/3 Sl 12:46 0:06 /usr/local/bin/python /usr/local/bin/gunicorn -b 0.0.0.0:9876 -w 1 dagobah_app:app brdwork 22295 0.0 0.0 61228 784 pts/4 R+ 13:05 0:00 grep gunicorn [brdwork@recbox04 shell_dagobah]$
This is my DEBUG log. I think ' DEBUG:paramiko.transport:EOF in transport thread ' is the key info. When the thread isn't EOF , dagobah_jobs don't get a an update.
DEBUG:paramiko.transport:starting thread (client mode): 0x5ea7b10L
INFO:paramiko.transport:Connected (version 2.0, client OpenSSH_4.3)
DEBUG:paramiko.transport:kex algos:['diffie-hellman-group-exchange-sha1', 'diffie-hellman-group14-sha1', 'diffie-hellman-group1-sha1'] server key:['ssh-rsa', 'ssh-dss'] client encrypt:['aes128-ctr', 'aes192-ctr', 'aes256-ctr', 'arcfour256', 'arcfour128', 'aes128-cbc', '3des-cbc', 'blowfish-cbc', 'cast128-cbc', 'aes192-cbc', 'aes256-cbc', 'arcfour', 'rijndael-cbc@lysator.liu.se'] server encrypt:['aes128-ctr', 'aes192-ctr', 'aes256-ctr', 'arcfour256', 'arcfour128', 'aes128-cbc', '3des-cbc', 'blowfish-cbc', 'cast128-cbc', 'aes192-cbc', 'aes256-cbc', 'arcfour', 'rijndael-cbc@lysator.liu.se'] client mac:['hmac-md5', 'hmac-sha1', 'hmac-ripemd160', 'hmac-ripemd160@openssh.com', 'hmac-sha1-96', 'hmac-md5-96'] server mac:['hmac-md5', 'hmac-sha1', 'hmac-ripemd160', 'hmac-ripemd160@openssh.com', 'hmac-sha1-96', 'hmac-md5-96'] client compress:['none', 'zlib@openssh.com'] server compress:['none', 'zlib@openssh.com'] client lang:[''] server lang:[''] kex follows?False
DEBUG:paramiko.transport:Ciphers agreed: local=aes128-ctr, remote=aes128-ctr
DEBUG:paramiko.transport:using kex diffie-hellman-group1-sha1; server key type ssh-rsa; cipher: local aes128-ctr, remote aes128-ctr; mac: local hmac-sha1, remote hmac-sha1; compression: local none, remote none
DEBUG:paramiko.transport:Switch to new keys ...
DEBUG:paramiko.transport:Trying key a6f65c1f81dafe5b3fb0d897ccf342b2 from /home/brdwork/.ssh/id_rsa
DEBUG:paramiko.transport:userauth is OK
INFO:paramiko.transport:Authentication (publickey) successful!
DEBUG:paramiko.transport:[chan 1] Max packet in: 34816 bytes
DEBUG:paramiko.transport:[chan 1] Max packet out: 32768 bytes
INFO:paramiko.transport:Secsh channel 1 opened.
DEBUG:paramiko.transport:[chan 1] Sesch channel 1 request ok
DEBUG:paramiko.transport:[chan 1] Sesch channel 1 request ok
DEBUG:paramiko.transport:[chan 1] EOF received (1)
DEBUG:paramiko.transport:[chan 1] EOF sent (1)
DEBUG:paramiko.transport:EOF in transport thread
DEBUG:paramiko.transport:starting thread (client mode): 0x5ea7b90L
INFO:paramiko.transport:Connected (version 2.0, client OpenSSH_4.3)
DEBUG:paramiko.transport:kex algos:['diffie-hellman-group-exchange-sha1', 'diffie-hellman-group14-sha1', 'diffie-hellman-group1-sha1'] server key:['ssh-rsa', 'ssh-dss'] client encrypt:['aes128-ctr', 'aes192-ctr', 'aes256-ctr', 'arcfour256', 'arcfour128', 'aes128-cbc', '3des-cbc', 'blowfish-cbc', 'cast128-cbc', 'aes192-cbc', 'aes256-cbc', 'arcfour', 'rijndael-cbc@lysator.liu.se'] server encrypt:['aes128-ctr', 'aes192-ctr', 'aes256-ctr', 'arcfour256', 'arcfour128', 'aes128-cbc', '3des-cbc', 'blowfish-cbc', 'cast128-cbc', 'aes192-cbc', 'aes256-cbc', 'arcfour', 'rijndael-cbc@lysator.liu.se'] client mac:['hmac-md5', 'hmac-sha1', 'hmac-ripemd160', 'hmac-ripemd160@openssh.com', 'hmac-sha1-96', 'hmac-md5-96'] server mac:['hmac-md5', 'hmac-sha1', 'hmac-ripemd160', 'hmac-ripemd160@openssh.com', 'hmac-sha1-96', 'hmac-md5-96'] client compress:['none', 'zlib@openssh.com'] server compress:['none', 'zlib@openssh.com'] client lang:[''] server lang:[''] kex follows?False
DEBUG:paramiko.transport:Ciphers agreed: local=aes128-ctr, remote=aes128-ctr
DEBUG:paramiko.transport:using kex diffie-hellman-group1-sha1; server key type ssh-rsa; cipher: local aes128-ctr, remote aes128-ctr; mac: local hmac-sha1, remote hmac-sha1; compression: local none, remote none
DEBUG:paramiko.transport:Switch to new keys ...
DEBUG:paramiko.transport:Trying key a6f65c1f81dafe5b3fb0d897ccf342b2 from /home/brdwork/.ssh/id_rsa
DEBUG:paramiko.transport:userauth is OK
INFO:paramiko.transport:Authentication (publickey) successful!
DEBUG:paramiko.transport:[chan 1] Max packet in: 34816 bytes
DEBUG:paramiko.transport:[chan 1] Max packet out: 32768 bytes
INFO:paramiko.transport:Secsh channel 1 opened.
DEBUG:paramiko.transport:[chan 1] Sesch channel 1 request ok
DEBUG:paramiko.transport:[chan 1] Sesch channel 1 request ok
DEBUG:paramiko.transport:[chan 1] EOF received (1)
DEBUG:paramiko.transport:[chan 1] EOF sent (1)
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
DEBUG:paramiko.transport:Sending global request "keepalive@lag.net"
i will try to use the supervisord to see if it will broken again .
update 2016-12-30
my solution is use docker , and use cron to restart it every hour , then currently it works well ,but should find the deep reason why the ui broken.
Hi,
sometimes (actually more often) I can't reach the UI anymore. My suspicion is a peak in load on the server which broke flask UI. In the log I found only the last 200's.
INFO:werkzeug:... - - [13/Jun/2014 08:37:17] "GET /api/job?jobname=DMProcessing HTTP/1.1" 200 - INFO:werkzeug:..._ - - [13/Jun/2014 08:37:19] "GET /api/job?jobname=DMProcessing HTTP/1.1" 200 - INFO:werkzeug:..._ - - [13/Jun/2014 08:37:20] "GET /api/job?jobname=DMProcessing HTTP/1.1" 200 - INFO:werkzeug:..._ - - [13/Jun/2014 08:37:22] "GET /api/job?jobname=DMProcessing HTTP/1.1" 200 - INFO:werkzeug:..._ - - [13/Jun/2014 08:37:23] "GET /api/job?job_name=DMProcessing HTTP/1.1" 200 -
I use mongodb backend and dagobah collections are in a separate db.
Many thanks for a hint Christian