Kuzzle crashes when ES throws a request refusal exception

scottinet commented 7 years ago

How to reproduce: send 200+ user creation requests in rapid fire (using Kuzzle JS SDK):

const 
  Kuzzle = require('kuzzle-sdk'),
  Bluebird = require('bluebird');

const kuzzle = new Kuzzle('localhost', (err, res) => {
  if (err) {
    console.error(err);
    process.exit(1);
  }

  const promises = [];

  for (let i = 0 ; i < 250 ; i++) {
    const userId = `foo${i}`;
    promises.push(kuzzle.security.createUserPromise(userId, {
      content: { profileIds: ['default']},
      credentials: {
        local: {
          username: userId,
          password: 'foobar'
        }
      }
    }));
  }

  Bluebird.all(promises)
    .then(() => process.exit(0));
});

Error thrown by Kuzzle:

InternalError: An error occured during the creation of user "foo44": [es_rejected_execution_exception] rejected execution of org.elasticsearch.transport.TransportService$7@21870f08 on EsThreadPoolExecutor[bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@663515b9[Running, pool size = 4, active threads = 4, queued tasks = 200, completed tasks = 500]];[es_rejected_execution_exception] rejected execution of org.elasticsearch.transport.TransportService$7@1e5ee045 on EsThreadPoolExecutor[bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@663515b9[Running, pool size = 4, active threads = 4, queued tasks = 200, completed tasks = 500]]
     at InternalError (/var/app/node_modules/kuzzle-common-objects/lib/errors/internalError.js:5:5)
     at persistUserAndCredentials (/var/app/lib/api/controllers/securityController.js:1065:11)
     at next (native)
 From previous event:
     at SecurityController.createUser (/var/app/lib/api/controllers/securityController.js:452:57)
     at triggerEvent.then.newRequest (/var/app/lib/api/controllers/funnelController.js:373:75)
 From previous event:
     at FunnelController.processRequest (/var/app/lib/api/controllers/funnelController.js:370:8)
     at checkRights.then.modifiedRequest (/var/app/lib/api/controllers/funnelController.js:205:37)
     at runCallback (timers.js:637:20)
     at tryOnImmediate (timers.js:610:5)
     at processImmediate [as _immediateCallback] (timers.js:582:5)
 From previous event:
     at FunnelController.execute (/var/app/lib/api/controllers/funnelController.js:205:8)
     at KuzzleProxy.onRequest (/var/app/lib/api/core/entryPoints/kuzzleProxy.js:79:22)
     at handlers.(anonymous function).forEach.cb (/var/app/lib/services/broker/wsBrokerClient.js:190:47)
     at Array.forEach (native)
     at WebSocket.client.socket.on.raw (/var/app/lib/services/broker/wsBrokerClient.js:190:33)
     at emitOne (events.js:96:13)
     at WebSocket.emit (events.js:188:7)
     at Receiver._receiver.onmessage (/var/app/node_modules/ws/lib/WebSocket.js:146:47)
     at Receiver.dataMessage (/var/app/node_modules/ws/lib/Receiver.js:389:14)
     at Receiver.getData (/var/app/node_modules/ws/lib/Receiver.js:330:12)
     at Receiver.startLoop (/var/app/node_modules/ws/lib/Receiver.js:165:16)
     at Receiver.add (/var/app/node_modules/ws/lib/Receiver.js:139:10)
     at Socket._ultron.on (/var/app/node_modules/ws/lib/WebSocket.js:142:22)
     at emitOne (events.js:96:13)
     at Socket.emit (events.js:188:7)
     at readableAddChunk (_stream_readable.js:176:18)
     at Socket.Readable.push (_stream_readable.js:134:10)
     at TCP.onread (net.js:548:20)

xbill82 commented 7 years ago

Tried with 1000 users. Unable to reproduce.

scottinet commented 7 years ago

Still able to reproduce it.

Here are some of the logs I obtain, with the exact snippet I posted in the initial post:

2017-07-24T07:57:02+00:00 [LOG:ERROR] (500) [es_rejected_execution_exception] rejected execution of org.elasticsearch.transport.TransportService$7@d4266ce on EsThreadPoolExecutor[bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@6d5792d7[Running, pool size = 4, active threads = 4, queued tasks = 200, completed tasks = 46]]
InternalError: [es_rejected_execution_exception] rejected execution of org.elasticsearch.transport.TransportService$7@d4266ce on EsThreadPoolExecutor[bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@6d5792d7[Running, pool size = 4, active threads = 4, queued tasks = 200, completed tasks = 46]]
    at respond (/var/app/node_modules/elasticsearch/src/lib/transport.js:295:15)
    at checkRespForFailure (/var/app/node_modules/elasticsearch/src/lib/transport.js:254:7)
    at HttpConnector.<anonymous> (/var/app/node_modules/elasticsearch/src/lib/connectors/http.js:159:7)
    at IncomingMessage.bound (/var/app/node_modules/elasticsearch/node_modules/lodash/dist/lodash.js:729:21)
    at emitNone (events.js:91:20)
    at IncomingMessage.emit (events.js:185:7)
    at endReadableNT (_stream_readable.js:974:12)
    at _combinedTickCallback (internal/process/next_tick.js:80:11)
    at process._tickDomainCallback (internal/process/next_tick.js:128:9)

And we also get a lot of these messages:

node:78) Warning: a promise was created in a handler at /var/app/lib/api/controllers/funnelController.js:213:30 but was not returned from it, see http://goo.gl/rRqMUw
     at Function.Promise.cast (/var/app/node_modules/bluebird/js/release/promise.js:196:13)

ballinette commented 7 years ago

I do not have the same trace, but errors like:

ERROR: unhandledRejection: not opened Error: not opened
    at WebSocket.send (/var/app/node_modules/ws/lib/WebSocket.js:358:18)
    at emit (/var/app/lib/services/broker/wsBrokerClient.js:335:31)
    at WSBrokerClient.send (/var/app/lib/services/broker/wsBrokerClient.js:310:12)
    at kuzzle.funnel.execute (/var/app/lib/api/core/entrypoints/proxy/index.js:113:18)
    at FunnelController._executeError (/var/app/lib/api/controllers/funnelController.js:491:7)
    at checkRights.then.then.catch.err (/var/app/lib/api/controllers/funnelController.js:214:26)
    at runCallback (timers.js:637:20)
    at tryOnImmediate (timers.js:610:5)
From previous event:
    at FunnelController.execute (/var/app/lib/api/controllers/funnelController.js:214:13)
    at ProxyEntryPoint.onRequest (/var/app/lib/api/core/entrypoints/proxy/index.js:95:24)
    at proxy.listen.data (/var/app/lib/api/core/entrypoints/proxy/index.js:56:51)
    at handlers.(anonymous function).forEach.cb (/var/app/lib/services/broker/wsBrokerClient.js:190:47)
    at Array.forEach (native)
    at WebSocket.client.socket.on.raw (/var/app/lib/services/broker/wsBrokerClient.js:190:33)
    at emitOne (events.js:96:13)
    at WebSocket.emit (events.js:188:7)
    at Receiver._receiver.onmessage (/var/app/node_modules/ws/lib/WebSocket.js:146:47)
    at Receiver.dataMessage (/var/app/node_modules/ws/lib/Receiver.js:389:14)
    at Receiver.getData (/var/app/node_modules/ws/lib/Receiver.js:330:12)
    at Receiver.startLoop (/var/app/node_modules/ws/lib/Receiver.js:165:16)
    at Receiver.add (/var/app/node_modules/ws/lib/Receiver.js:139:10)
    at Socket._ultron.on (/var/app/node_modules/ws/lib/WebSocket.js:142:22)
    at emitOne (events.js:96:13)
    at Socket.emit (events.js:188:7)
    at readableAddChunk (_stream_readable.js:176:18)
    at Socket.Readable.push (_stream_readable.js:134:10)

ballinette commented 7 years ago

ok, my traces was due to a bad configuration.

Now, with latests docker images and 1.x branch updated, I cannot reproduce the bug, even with 1000 documents in a row

I've just had in one test (but not reproduced later either) some failed connections errors between Kuzzle and Elasticsearch, but Kuzzle did not crash.

ballinette commented 7 years ago

waiting for retest the scenario in Seb's instance, when we'll have the time to do it.

ballinette commented 7 years ago

OK, reproduced when I decrease the queue_size of thread_pools in elasticsearch settings (for search queries):

thread_pool:
    search:
        queue_size: 20

scottinet commented 7 years ago

That's normal, since Kuzzle's default configuration allows it to handle up to 50 concurrent requests. And it's nice to fix the diagtools trigger when that happens.

But what I don't get is why I'm able to reproduce this with 100% reliability and not you nor @xbill82 , with a 200 requests queue configured in ES.

I've enabled debug logs in Elasticsearch, and though the error message states that the Bulk ThreadPool is saturated, rejected requests are really user document creations.

This would suggest that Kuzzle doesn't chain requests correctly and is able to perform n requests to ES for each user request, but I'm still looking for the code responsible of this.

benoitvidis commented 7 years ago

Reproducing the issue too.

My first assumption is that, when using wait_for option, create calls will actually trigger the bulk thread pool. I will check in es source code.

I also traced requests sent to ES. Each user creation generates 4 requests:

_search
GET
_search
_create

One of the 2 searches can probably be skipped (they use the same parameters) but this does not explain this particular issue as _search and _write use a different thread pool.

benoitvidis commented 7 years ago

I did not drill long enough in es code to confirm/infirm the above assumption.

Since this issue was opened, Kuzzle does not crash anymore
Everything behaves as if ES was dealing with all _create requests at once, in which case, the exception makes perfect sense.

I've come to the conclusion the exception is normal when the wait_for option.

I just tested a regular document:create loop with and without the wait_for option and even if I had to upsize the loop count, results are pretty clear: we do get exceptions when using wait_for while we don't without.

The only real problem I am seeing there is that the error message is sent to the end-user and does not appear in Kuzzle error logs. This is OK for most of ES errors, but this particular case may deserve to be more visible from admins/developers as I suppose there could be cases where modifying thread buffers size would not be a problem.

kuzzleio / kuzzle

Kuzzle crashes when ES throws a request refusal exception #900