tableflip / guvnor

A node process manager that isn't spanners all the way down
MIT License
430 stars 37 forks source link

guv daemon dying in-flight #70

Closed randomsock closed 8 years ago

randomsock commented 9 years ago

This is different from #64, and more serious, because it's happened randomly on 2 different installations so far and taken everything down.

Node 0.12.4 guvnor 3.5.12

From /var/log/guvnor/guvnor.error.log at the time it disappeared according to our monitor: (both incidents same)

{ date: 'Thu Jul 09 2015 03:58:11 GMT+0200 (CEST)',
  process:
   { pid: 32696,
     uid: 0,
     gid: 3017,
     cwd: '/var/run/guvnor',
     execPath: '/usr/local/bin/node',
     version: 'v0.12.4',
     argv:
      [ '/usr/local/bin/node',
        '/usr/local/lib/node_modules/guvnor/lib/daemon/index.js' ],
     memoryUsage: { rss: 177487872, heapTotal: 132897792, heapUsed: 20059776 } },
  os:
   { loadavg: [ 0.0615234375, 0.0869140625, 0.02880859375 ],
     uptime: 4191812 },
  trace:
   [ { column: 3,
       file: '_stream_writable.js',
       function: 'afterWrite',
       line: 361,
       method: null,
       native: false },
     { column: 7,
       file: '_stream_writable.js',
       function: 'onwrite',
       line: 352,
       method: null,
       native: false },
     { column: 5,
       file: '_stream_writable.js',
       function: 'TLSSocket.WritableState.onwrite',
       line: 105,
       method: 'WritableState.onwrite',
       native: false },
     { column: 12,
       file: 'net.js',
       function: 'WriteWrap.afterWrite',
       line: 787,
       method: 'afterWrite',
       native: false } ],
  stack:
   [ 'TypeError: object is not a function',
     '    at afterWrite (_stream_writable.js:361:3)',
     '    at onwrite (_stream_writable.js:352:7)',
     '    at TLSSocket.WritableState.onwrite (_stream_writable.js:105:5)',
     '    at WriteWrap.afterWrite (net.js:787:12)' ],
  level: 'error',
  message: 'uncaughtException: object is not a function',
  timestamp: '2015-07-09T01:58:11.523Z' }
achingbrain commented 9 years ago

Hmm, not great. dnode is wrapped in a TLS socket so that could be why it's coming from the TLSSocket stream.

Do you know what was happening on the server when it crashed?

randomsock commented 9 years ago

Nothing in particular. It was 3am local time, so pretty much dead zone even for our international users.

I've checked through various other logs and they all appear normal, if quiet. Nginx for example reverse-proxies to local services and was seeing nothing but our automated monitoring at the time, then the upstream to guvnor services suddenly stopped accepting (Connection Refused, as you'd expect).

randomsock commented 8 years ago

Any update on this? It's still happening, only occasionally but enough to put pressure on us to write a job to monitor and restart guv, which kinda defeats the object to some degree.

Again, nothing usual happening or reported at the time, just that same trace in guvnor.error.log.

Removed-5an commented 8 years ago

Same issue:

{
   "date":"Sat Oct 10 2015 09:28:41 GMT+0200 (CEST)",
   "process":{
      "pid":8683,
      "uid":0,
      "gid":1001,
      "cwd":"/run/guvnor",
      "execPath":"/home/sanity/.nvm/versions/node/v4.1.1/bin/node",
      "version":"v4.1.1",
      "argv":[
         "/home/sanity/.nvm/versions/node/v4.1.1/bin/node                                                               ",
         "/home/sanity/.nvm/versions/node/v4.1.1/lib/node_modules/guvnor/lib/daemon/index.js"
      ],
      "memoryUsage":{
         "rss":44298240,
         "heapT                                                               otal":25778944,
         "heapUsed":15580496
      }
   },
   "os":{
      "loadavg":[
         1.3232421875,
         1.60498046875,
         1.6416015625
      ],
      "uptime":726480
   },
   "trace":[
      {
         "                                                               column":3,
         "file":"_stream_writable.js",
         "function":"afterWrite",
         "line":346,
         "method":null,
         "native":false
      },
      {
         "column":7,
         "file":"_stream_writable.js",
         "function":"onwrite",
         "line":337,
         "method":null,
         "native":false
      },
      {
         "column":5,
         "file":"_stream_writable.js                                                               ",
         "function":"TLSSocket.WritableState.onwrite",
         "line":89,
         "method":"WritableState.onwrite",
         "native":false
      },
      {
         "column":12,
         "fil                                                               e":"net.js",
         "function":"WriteWrap.afterWrite",
         "line":772,
         "method":"afterWrite",
         "native":false
      }
   ],
   "stack":[
      "TypeError: object                                                                is not a function",
      "    at afterWrite (_stream_writable.js:346:3)",
      "    at onwrite (_stream_writable.js:337:7)",
      "    at TL                                                               SSocket.WritableState.onwrite (_stream_writable.js:89:5)",
      "    at WriteWrap.afterWrite (net.js:772:12)"
   ],
   "level":"error",
   "m                                                               essage":"uncaughtException: object is not a function",
   "timestamp":"2015-10-10T07:28:41.116Z"
}
vuza commented 8 years ago

Same thing happened to me twice (on two servers) last 5 days, nothing remarkable happened on the servers, when errors occurred.

Any help would be appreciated.

Regards, Marlon

{  
   "date":"Mon Nov 30 2015 09:19:47 GMT+0100 (CET)",
   "process":{  
      "pid":19109,
      "uid":0,
      "gid":1006,
      "cwd":"/run/guvnor",
      "execPath":"/usr/local/bin/node",
      "version":"v0.12.7",
      "argv":[  
         "/usr/local/bin/node",
         "/usr/local/lib/node_modules/guvnor/lib/daemon/index.js"
      ],
      "memoryUsage":{  
         "rss":88379392,
         "heapTotal":72061696,
         "heapUsed":11517336
      }
   },
   "os":{  
      "loadavg":[  
         0.0029296875,
         0.0146484375,
         0.04541015625
      ],
      "uptime":2836570
   },
   "trace":[  
      {  
         "column":3,
         "file":"_stream_writable.js",
         "function":"afterWrite",
         "line":361,
         "method":null,
         "native":false
      },
      {  
         "column":7,
         "file":"_stream_writable.js",
         "function":"onwrite",
         "line":352,
         "method":null,
         "native":false
      },
      {  
         "column":5,
         "file":"_stream_writable.js",
         "function":"TLSSocket.WritableState.onwrite",
         "line":105,
         "method":"WritableState.onwrite",
         "native":false
      },
      {  
         "column":12,
         "file":"net.js",
         "function":"WriteWrap.afterWrite",
         "line":787,
         "method":"afterWrite",
         "native":false
      }
   ],
   "stack":[  
      "TypeError: object is not a function",
      "    at afterWrite (_stream_writable.js:361:3)",
      "    at onwrite (_stream_writable.js:352:7)",
      "    at TLSSocket.WritableState.onwrite (_stream_writable.js:105:5)",
      "    at WriteWrap.afterWrite (net.js:787:12)"
   ],
   "level":"error",
   "message":"uncaughtException: object is not a function",
   "timestamp":"2015-11-30T08:19:47.081Z"
}
randomsock commented 8 years ago

@5an1ty @vuza - use PM2 instead, works well and is properly supported

alanshaw commented 8 years ago

Wow, when open source goes bad. I'm pretty sure that wasn't necessary @randomsock.

Could this possibly be the issue with Node.js that was fixed very recently: https://nodejs.org/en/blog/vulnerability/december-2015-security-releases/#cve-2015-8027-denial-of-service-vulnerability

randomsock commented 8 years ago

No offence intended @alanshaw, but with no sign of any real development in months despite issues outstanding, including almost all of my own dating back to April, I kinda figured this project was stale. In the meantime, PM2 has at last matured and is now production grade - remembering that we originally ditched PM2 out of frustration in favour of guvnor that showed so much promise. My comment is entirely substantiated and was intended only as a suggestion to the others who have a job to do and are still concerned about guvnor's stability. If you can demonstrate that that Node issue is the cause of this particular problem then that's great, but there's still the bigger picture.

achingbrain commented 8 years ago

@randomsock I appreciate your frustration with this issue, however I've been unable to replicate it so it's been rather hard to fix.

The issue seems to be that the socket connection has lost it's context somewhere deep inside node core. It craps out invoking cb which is stored in the state variable - so it sounds similar to the bug that @alanshaw linked to. If you could try upgrading node and see if you still have the problem that would be helpful.

Failing that, if you could attempt to provide some way to replicate the problem I will gladly look at it - if you provide a PR to fix the problem I will definitely merge it. This is open source - it's what we make it and everyone's time is voluntary. We all have jobs - if you are concerned about guvnor's stability and it's affecting your ability to do your job then please donate some time and help out by looking into the issue because, you know, given enough eyeballs all bugs are shallow and all that.