oveits / ProvisioningEngine

Ruby on Rails based ProvisioningEngine Frontend for provisioning of legacy systems via Apache Camel Backend (SOAP/XML+SPML+File import)
3 stars 6 forks source link

Delayed Jobs crashing/stopping #11

Closed oveits closed 8 years ago

oveits commented 8 years ago

The second time within a few hours, we see, that the Delayed Jobs are stopped:

provisioningengine@ProvisioningEngine:~/ProvisioningEnginev0.5.15$ ./status
Web Portal (productive):   running (PID=22822) on port 3000
Web Portal (development):   running (PID=19406) on port 3001
Delayed Jobs: not running

This leads to objects hanging in the status "waiting for provisioning" and "synchronization in progress" forever: image

ProvisioningTasks are hanging: image

Workaround is to restart the service like follows:

provisioningengine@ProvisioningEngine:~/ProvisioningEnginev0.5.15$ ./startDelayedJobs.sh
Delayed Jobs started: PID=29548

However, this is not a solution, if the Delayed Jobs are crashing frequently.

How to troubleshoot crashed processes like this? Are there any logs? Note, that we cannot easily upgraded the Delayed Jobs gem, since the new versions have a problem (retry, remove buttons are not shown in Delayed Jobs Web).

oveits commented 8 years ago

Is stopped again. If I restart, Delayed Jobs is stopped after some seconds again. Now trying a reboot.

oveits commented 8 years ago

This is the output, if I start delayed jobs in the foreground:

./startDelayedJobs_foreground.sh
[Worker(host:ProvisioningEngine pid:23008)] Starting job worker
[Worker(host:ProvisioningEngine pid:23008)] Job Site.synchronizeAllSynchronously (id=1410) RUNNING
"------------- HttpPostRequest POST Data to http://192.168.113.104:80/ProvisioningEngine -----------------"
"{\"action\"=>\"Show Sites\", \"customerName\"=>\"ACME01\", \"OSVIP\"=>\"192.168.113.212\", \"OSVMgmtIP\"=>\"192.168.113.212\", \"OSVSshPort\"=>\"22\", \"XPRIP\"=>\"192.168.113.212\", \"UCIP\"=>\"192.168.113.208\", \"WebCDCIP\"=>\"UNKNOWN\", \"OSVauthUsername\"=>\"srx\", \"OSVauthPassword\"=>\"2GwN!gb4\", \"OSVauthPasswordRoot\"=>\"Pa$$w0rd!\", \"OSVauthPasswordSysad\"=>\"Asd123!.\", \"XPRauthUsername\"=>\"Administrator\", \"XPRauthPassword\"=>\"Pa$$w0rd\", \"UCauthUsername\"=>\"Administrator@system\", \"UCauthPassword\"=>\"Asd123!.\", \"FPCREATEOmit\"=>\"true\", \"FPAFOmit\"=>\"true\"}"
"----------------------------------------------------------"
"------------------resulttext------------------"
"resulttext = connection timout for http://192.168.113.104:80/ProvisioningEngine at 2016-01-14 12:45:14 +0100"
"returnvalue = 8"
"------------------resulttext------------------"
"SSSSSSSSSSSSSSSSSSSSSSSSS    Site.synchronizeAll responseBody    SSSSSSSSSSSSSSSSSSSSSSSSS"
"8"
synchronizeAllSynchronously: ERROR: wrong responseBody type (Fixnum) instead of String)
Delayed Jobs started: PID=
provisioningengine@ProvisioningEngine:~/ProvisioningEnginev0.5.15$

We can see that the Delayed Jobs is stopped with the abort found on "app/models/provisioningobject.rb"

abort "synchronizeAllSynchronously: ERROR: wrong responseBody type (#{responseBody.class.name}) instead of String)" unless responseBody.is_a?(String)

With this abort, the task should be retried, but instead, delayed jobs worker stops.

The task is a synchronizeAllSynchronously. Here the data from https://192.168.113.105/delayed_job/pending#1410 (will be removed from the queue now):

ID
1410 Retry or Remove or Reload Job
Priority
0
Attempts
0
Handler
--- !ruby/object:Delayed::PerformableMethod
object: !ruby/class 'Site'
method_name: :synchronizeAllSynchronously
args:
- - !ruby/object:Customer
    attributes:
      id: 113
      name: ACME01
      created_at: 2015-02-02 08:26:06.399517000 Z
      updated_at: 2015-09-02 20:11:27.115936000 Z
      status: deletion failed (timed out); trying again
      target_id: 44
      language: german
  - !ruby/object:Customer
    attributes:
      id: 114
      name: CCMFed
      created_at: 2015-02-02 08:56:59.226939000 Z
      updated_at: 2015-09-02 19:54:29.010734000 Z
      status: deletion failed (timed out); trying again
      target_id: 44
      language: englishGB
  - !ruby/object:Customer
    attributes:
      id: 190
      name: ExampleCustomerV8
      created_at: 2015-05-07 13:49:11.942109000 Z
      updated_at: 2015-09-02 19:03:34.693582000 Z
      status: deletion failed (timed out); trying again
      target_id: 44
      language: 
- false
- true
Run At
about 14 hours ago
oveits commented 8 years ago

After upgrading delayed_job_active_record to v4.1.1 like follows:

gem install delayed_job_active_record
bundle update

the delayed jobs seems to work again: if there is a timeout, the Delayed Jobs process is not stopped anymore. Instead, the failed job is re-queued for a second try (default up to 25 times)

Full log: 1) Installation

$ gem install delayed_job_active_record
Fetching: delayed_job_active_record-4.1.0.gem (100%)
Successfully installed delayed_job_active_record-4.1.0
Parsing documentation for delayed_job_active_record-4.1.0
Installing ri documentation for delayed_job_active_record-4.1.0
Done installing documentation for delayed_job_active_record after 0 seconds
1 gem installed

$ bundle update
Fetching gem metadata from https://rubygems.org/...........
Fetching additional metadata from https://rubygems.org/..
Resolving dependencies...
Using rake 10.5.0
Using i18n 0.7.0
Using json 1.8.3
Using minitest 5.8.3
Using thread_safe 0.3.5
Using tzinfo 1.2.2
Using activesupport 4.1.4
Using builder 3.2.2
Using erubis 2.7.0
Using actionview 4.1.4
Using rack 1.5.5
Using rack-test 0.6.3
Using actionpack 4.1.4
Using mime-types 1.25.1
Using polyglot 0.3.5
Using treetop 1.4.15
Using mail 2.5.4
Using actionmailer 4.1.4
Using activemodel 4.1.4
Using arel 5.0.1.20140414130214
Using activerecord 4.1.4
Using addressable 2.4.0
Using execjs 2.6.0
Using autoprefixer-rails 6.2.3
Using sass 3.2.19
Using bootstrap-sass 3.3.5
Using bundler 1.7.4
Using mini_portile2 2.0.0
Using nokogiri 1.6.7.1
Using xpath 2.0.0
Using capybara 2.1.0
Using ffi 1.9.10
Using childprocess 0.5.9
Using choice 0.2.0
Using coffee-script-source 1.10.0
Using coffee-script 2.4.1
Using thor 0.19.1
Using railties 4.1.4
Using coffee-rails 4.0.1
Using cookiejar 0.3.0
Using daemons 1.2.3
Installing delayed_job 4.1.1 (was 4.0.6)
Using delayed_job_active_record 4.1.0 (was 4.0.1)
Using rack-protection 1.5.3
Using tilt 1.4.1
Using sinatra 1.4.6
Using delayed_job_web 1.2.5
Using diff-lcs 1.2.5
Using eventmachine 1.0.9.1
Using em-socksify 0.3.1
Using http_parser.rb 0.6.0
Using em-http-request 1.1.3
Using factory_girl 4.2.0
Using factory_girl_rails 4.2.0
Using figaro 1.1.1
Using hike 1.2.3
Using multi_json 1.11.2
Using jbuilder 2.4.0
Using jquery-rails 3.1.4
Using turbolinks 2.5.3
Using jquery-turbolinks 2.1.0
Using kaminari 0.16.3
Using pg 0.15.1
Using sprockets 2.12.4
Using sprockets-rails 2.3.3
Using rails 4.1.4
Using ruby-graphviz 1.2.2
Using rails-erd 1.4.5
Using rails_serve_static_assets 0.0.4
Using rails_stdout_logging 0.0.4
Using rails_12factor 0.0.2
Using rdoc 4.2.1
Using respond-rails 1.0.1
Using rspec-core 2.13.1
Using rspec-expectations 2.13.0
Using rspec-mocks 2.13.1
Using rspec-rails 2.13.1
Using rubyzip 0.9.9
Using sass-rails 4.0.5
Using sdoc 0.4.1
Using seed_dump 3.2.4
Using websocket 1.0.7
Using selenium-webdriver 2.35.1
Using spork 1.0.0rc4
Using spork-rails 4.0.0
Using sqlite3 1.3.11
Using uglifier 2.7.2
Your bundle is updated!

Test:

./startDelayedJobs_foreground.sh
[Worker(host:ProvisioningEngine pid:24994)] Starting job worker
[Worker(host:ProvisioningEngine pid:24994)] Job Site.synchronizeAllSynchronously (id=1445) RUNNING
"------------- HttpPostRequest POST Data to http://192.168.113.104:80/ProvisioningEngine -----------------"
"{\"action\"=>\"Show Sites\", \"customerName\"=>\"CSL6796\", \"OSVIP\"=>\"192.168.124.42\", \"OSVMgmtIP\"=>\"192.168.124.42\", \"OSVSshPort\"=>\"22\", \"XPRIP\"=>\"192.168.112.73\", \"UCIP\"=>\"192.168.124.42\", \"WebCDCIP\"=>\"UNKNOWN\", \"OSVauthUsername\"=>\"srx\", \"OSVauthPassword\"=>\"2GwN!gb4\", \"OSVauthPasswordRoot\"=>\"T@R63dis\", \"OSVauthPasswordSysad\"=>\"Asd123!.\", \"XPRauthUsername\"=>\"Administrator\", \"XPRauthPassword\"=>\"Asd123!.\", \"UCauthUsername\"=>\"Administrator@system\", \"UCauthPassword\"=>\"Asd123!.\", \"FPCREATEOmit\"=>\"true\", \"FPAFOmit\"=>\"true\"}"
"----------------------------------------------------------"
provisioning.deliver: connection timout of one or more target systems
[Worker(host:ProvisioningEngine pid:24994)] Job Site.synchronizeAllSynchronously (id=1445) FAILED (0 prior attempts) with SystemExit: provisioning.deliver: connection timout of one or more target systems
[Worker(host:ProvisioningEngine pid:24994)] Job Site.synchronizeAllSynchronously (id=1446) RUNNING
"------------- HttpPostRequest POST Data to http://192.168.113.104:80/ProvisioningEngine -----------------"
"{\"action\"=>\"Show Sites\", \"customerName\"=>\"AirportMCO\",...

We can see that the process continues to work after the connection timout of one or more target systems.