moleculerjs / moleculer

:rocket: Progressive microservices framework for Node.js
https://moleculer.services/
MIT License
6.17k stars 587 forks source link

metrics make request 10x times slower #313

Closed roytan883 closed 6 years ago

roytan883 commented 6 years ago

The issue scenario is : apiGateway.rest -> auth.authorize->demo.welcome->demo.hello every step is very simple, should only consider network cost.

If i enable metrics: true in each services process, use about 120ms total. Buf if i disable metrics, it only use about 10ms.

metrics internal use event broadcast ? Then is should not impact normal performance so badly.

roytan883 commented 6 years ago

Each step duration looks good, only a few ms. But why the final apiGateway.rest count duration a long time:

INFO -> metrics traceStart:  { id: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  requestID: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  level: 1,
  startTime: 1530602761235,
  remoteCall: false,
  action: { name: 'apiGateway.rest' },
  nodeID: 'rts-apiGateway-dev-88-0' }
INFO -> metrics traceStart:  { id: '6ba3c7bc-a194-43c2-8f34-34380de6d60c',
  requestID: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  level: 2,
  startTime: 1530602761271,
  remoteCall: true,
  meta: {},
  action: { name: 'auth.authorize' },
  parent: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  nodeID: 'rts-auth-dev-88-0',
  callerNodeID: 'rts-apiGateway-dev-88-0' }
INFO -> metrics traceFinish:  { id: '6ba3c7bc-a194-43c2-8f34-34380de6d60c',
  requestID: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  level: 2,
  startTime: 1530602761271,
  endTime: 1530602761273.025,
  duration: 2.024912,
  remoteCall: true,
  fromCache: false,
  meta: {},
  action: { name: 'auth.authorize' },
  parent: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  nodeID: 'rts-auth-dev-88-0',
  callerNodeID: 'rts-apiGateway-dev-88-0' }
INFO -> metrics traceStart:  { id: '6c079ee3-0f1a-45fc-8c9d-fc2f5bd23ce4',
  requestID: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  level: 2,
  startTime: 1530602761314,
  remoteCall: true,
  meta: {},
  action: { name: 'demo.welcome' },
  parent: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  nodeID: 'rts-demo-dev-88-0',
  callerNodeID: 'rts-apiGateway-dev-88-0' }
INFO -> metrics traceStart:  { id: '47682a4f-6c45-4c98-9130-9372dfecb748',
  requestID: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  level: 3,
  startTime: 1530602761314,
  remoteCall: false,
  meta: {},
  action: { name: 'demo.hello' },
  parent: '6c079ee3-0f1a-45fc-8c9d-fc2f5bd23ce4',
  nodeID: 'rts-demo-dev-88-0' }
INFO -> metrics traceFinish:  { id: '47682a4f-6c45-4c98-9130-9372dfecb748',
  requestID: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  level: 3,
  startTime: 1530602761314,
  endTime: 1530602761315.0862,
  duration: 1.086225,
  remoteCall: false,
  fromCache: false,
  meta: {},
  action: { name: 'demo.hello' },
  parent: '6c079ee3-0f1a-45fc-8c9d-fc2f5bd23ce4',
  nodeID: 'rts-demo-dev-88-0' }
INFO -> metrics traceFinish:  { id: '6c079ee3-0f1a-45fc-8c9d-fc2f5bd23ce4',
  requestID: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  level: 2,
  startTime: 1530602761314,
  endTime: 1530602761315.9263,
  duration: 1.92624,
  remoteCall: true,
  fromCache: false,
  meta: {},
  action: { name: 'demo.welcome' },
  parent: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  nodeID: 'rts-demo-dev-88-0',
  callerNodeID: 'rts-apiGateway-dev-88-0' }
INFO -> metrics traceFinish:  { id: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  requestID: '1e9f1a84-85eb-466c-9aae-223abf74acac',
  level: 1,
  startTime: 1530602761235,
  endTime: 1530602761354.5188,
  duration: 119.518744,
  remoteCall: false,
  fromCache: false,
  action: { name: 'apiGateway.rest' },
  nodeID: 'rts-apiGateway-dev-88-0' }
icebob commented 6 years ago

Which versions?

Btw, please use issue templates when you open a new issue.

icebob commented 6 years ago

I've just measured this demo with jMeter: With metrics: false: ~10k req/sec With metrics: true: ~8k req/sec

roytan883 commented 6 years ago

I think i found the problem. if i start the process which listen those "metrics" event(metrics.trace.span.start/finish), then whole system slow to "120ms" each request(the event handler is very simple, i even try empty function). if i stop the listen process, the whole system back to normal "10ms" each request.

roytan883 commented 6 years ago

version:0.12.8

icebob commented 6 years ago

And the listener is on the same node, or in a remote node?

roytan883 commented 6 years ago

a remote node, every service in this issue are running on a separate node process

icebob commented 6 years ago

Ok, I will investigate it.

roytan883 commented 6 years ago

by the way, transporter is "nats"

icebob commented 6 years ago

What I measured: API gateway with metrics without listener: 8k req/sec API gateway with metrics with a remote listener: 3k req/sec

Btw, it was just a simple "test.hello" action.

roytan883 commented 6 years ago

it was just a simple "test.hello" action. a remote listener: 3k req/sec

yes, that's the problem. my scenario has apiGateway.rest -> auth.authorize->demo.welcome->demo.hello, more remote actions, much slower.

Maybe you should test its latency?

icebob commented 6 years ago

If you can make a repro repo.

roytan883 commented 6 years ago

you just make request like this: client -> apiGateway(test.hello) -> test.hello -> ctx.call test1.hello1 -> ctx.call test2.hello2 place test/test1/test2 in separate node, and another node listen metrics

yes, I can create a repo. But do you need that ? you already measured that remote node listen metrics lead performance badly slower.

icebob commented 6 years ago

It's reasonable that metric function needs extra performance which increases the response time. Unfortunately, in your case, it's a huge difference because every action call generates 2 extra events what need to be transferred to another node which causes overhead. The bad news that I can't fix it currently because it is not a bug it's a design issue.

In the version 0.14, I will rewrite the whole metrics & tracing logic to be more efficient and pluggable. Meanwhile, as a workaround, you can disable this built-in metric function and write a middleware which tracing action calls and send them to a central server (bypassing the moleculer message bus)

roytan883 commented 6 years ago

Maybe its not about the metric, i dig the source code, the metrics just call broker.emit. Then i make a test for event emit. Not listen (metrics.trace.span.start/finish).

In listener node:

      events: {
        // "metrics.trace.span.start": this.traceStart,
        // "metrics.trace.span.finish": this.traceFinish,
        "event.test.t1": this.t1,
        "event.test.t2": this.t2,
      },

async t1(info) {
    this.logger.info(`${this.name} t1: `, util.inspect(info));
  }
  async t2(info) {
    this.logger.info(`${this.name} t2: `, util.inspect(info));
  }

in auth:

async authorize(ctx) {
    this.broker.emit("event.test.t1", {
      a: "abc",
      b: 123,
      c: true
    })
    console.log("call authorize ctx.params = ", ctx.params)
    if (ctx.params.action.indexOf('$node') >= 0) { //disable call $node.*
      return false
    }
    return true
  }

in demo:

  async welcome(ctx) {
    this.broker.emit("event.test.t2", {
      a: "abc",
      b: 123,
      c: true
    })
    // await ctx.call('demo.hello')
    return this._sayWelcome(ctx.params.name);
  }

the result is: 1, if not send event.test.t1 and event.test.t2, use 8.396ms 2, if send event.test.t1, use 51.806ms 2, if send event.test.t1 and event.test.t2, use 81.931ms

roytan883 commented 6 years ago

simple event take 30~40ms, doesn't make sence, event should not much slower than action

roytan883 commented 6 years ago

I guess the problem is not about how to generate metrics. It is about how to send event without impact action performance.

icebob commented 6 years ago

Hmm, it's weird. Please create a dev example file with 3 brokers & services & NATS like here and I will try to investigate it.

roytan883 commented 6 years ago

OK, i will create a dev example file

roytan883 commented 6 years ago

OK, finished the result is:

call demo1.hello1 at:  1530627615760
call demo2.hello2 at:  1530627615761
call demo3.hello3 at:  1530627615762
call demo1.hello1 use time =  6
call demo1.hello1WithEvent at:  1530627615765
listen t1 at:  1530627615765
call demo2.hello2WithEvent at:  1530627615804
listen t2 at:  1530627615805
call demo3.hello3WithEvent at:  1530627615844
listen t3 at:  1530627615844
call demo1.hello1WithEvent use time =  120

test code is:

const { ServiceBroker } = require("moleculer");

let natsTransporter = {
  type: "NATS",
  options: {
    url: "nats://127.0.0.1:4222",
  }
}

// Create broker Client
const brokerClient = new ServiceBroker({
  namespace: "streaming",
  nodeID: "client",
  transporter: natsTransporter,
  logger: console,
  logLevel: "info"
});

// Create broker #1
const broker1 = new ServiceBroker({
  namespace: "streaming",
  nodeID: "demo1",
  transporter: natsTransporter,
  logger: console,
  logLevel: "info"
});

// Create broker #2
const broker2 = new ServiceBroker({
  namespace: "streaming",
  nodeID: "demo2",
  transporter: natsTransporter,
  logger: console,
  logLevel: "info"
});

// Create broker #3
const broker3 = new ServiceBroker({
  namespace: "streaming",
  nodeID: "demo3",
  transporter: natsTransporter,
  logger: console,
  logLevel: "info"
});

// Create brokerListen
const brokerListen = new ServiceBroker({
  namespace: "streaming",
  nodeID: "listen",
  transporter: natsTransporter,
  logger: console,
  logLevel: "info"
});

broker1.createService({
  name: "demo1",
  actions: {
    hello1(ctx) {
      console.log("call demo1.hello1 at: ", Date.now())
      return ctx.call('demo2.hello2')
    },
    hello1WithEvent(ctx) {
      broker1.emit('t1', { a: 'abc', b: 123, c: true })
      console.log("call demo1.hello1WithEvent at: ", Date.now())
      return ctx.call('demo2.hello2WithEvent')
    }
  }
});

broker2.createService({
  name: "demo2",
  actions: {
    hello2(ctx) {
      console.log("call demo2.hello2 at: ", Date.now())
      return ctx.call('demo3.hello3')
    },
    hello2WithEvent(ctx) {
      broker2.emit('t2', { a: 'abc', b: 123, c: true })
      console.log("call demo2.hello2WithEvent at: ", Date.now())
      return ctx.call('demo3.hello3WithEvent')
    }
  }
});

broker3.createService({
  name: "demo3",
  actions: {
    hello3(ctx) {
      console.log("call demo3.hello3 at: ", Date.now())
      return "hello from demo3"
    },
    hello3WithEvent(ctx) {
      broker3.emit('t3', { a: 'abc', b: 123, c: true })
      console.log("call demo3.hello3WithEvent at: ", Date.now())
      return "hello from hello3WithEvent"
    }
  }
});

brokerListen.createService({
  name: "listen",
  events: {
    t1: (ctx) => {
      console.log("listen t1 at: ", Date.now())
    },
    t2: (ctx) => {
      console.log("listen t2 at: ", Date.now())
    },
    t3: (ctx) => {
      console.log("listen t3 at: ", Date.now())
    },
  }
});

brokerClient.Promise.all([brokerClient.start(), broker1.start(), broker2.start(), broker3.start(), brokerListen.start()])
  .delay(2000)
  .then(async () => {

    let startTime = Date.now()
    await brokerClient.call('demo1.hello1')
    let endTime = Date.now()
    console.log("call demo1.hello1 use time = ", endTime - startTime)

    startTime = Date.now()
    await brokerClient.call('demo1.hello1WithEvent')
    endTime = Date.now()
    console.log("call demo1.hello1WithEvent use time = ", endTime - startTime)

    // startTime = Date.now()
    // await brokerClient.call('demo1.hello1WithEvent')
    // endTime = Date.now()
    // console.log("call demo1.hello1WithEvent again use time = ", endTime - startTime)

  });
icebob commented 6 years ago

Thanks, I will investigate it tonight.

icebob commented 6 years ago

It's weirder. It's the result on my PC:

call demo1.hello1 at:  1530633313582
call demo2.hello2 at:  1530633313585
call demo3.hello3 at:  1530633313587
call demo1.hello1 use time =  17
call demo1.hello1WithEvent at:  1530633313596
listen t1 at:  1530633313599
call demo2.hello2WithEvent at:  1530633313601
listen t2 at:  1530633313603
call demo3.hello3WithEvent at:  1530633313604
listen t3 at:  1530633313606
call demo1.hello1WithEvent use time =  15
icebob commented 6 years ago

Node version, OS, architecture...etc?

icebob commented 6 years ago

I could reproduce it on Ubuntu

call demo1.hello1 at:  1530644013498
call demo2.hello2 at:  1530644013501
call demo3.hello3 at:  1530644013503
call demo1.hello1 use time =  17
call demo1.hello1WithEvent at:  1530644013513
listen t1 at:  1530644013514
call demo2.hello2WithEvent at:  1530644013560
listen t2 at:  1530644013561
call demo3.hello3WithEvent at:  1530644013604
listen t3 at:  1530644013604
call demo1.hello1WithEvent use time =  139
icebob commented 6 years ago

It's very mystic! With NATS: ~8 rps (~125ms) With Redis: ~500 rps (~2ms) With TCP: ~1700 rps (~0.5ms) linux-transporter-emits

And it's only on Linux. On Windows it's ~1500-2000 rps with all transporters. Do you have any idea? Maybe NATS client problem because it can send messages parallel?

Dev test file: https://github.com/moleculerjs/moleculer/blob/master/dev/emit-latency.js

roytan883 commented 6 years ago

have you try different nats client library version?

icebob commented 6 years ago

I will, but currently I can't.

icebob commented 6 years ago

I tried with 0.8, 0.7...same result.

AndreMaz commented 6 years ago

Same thing here.

call demo1.hello1 at:  1530693198252
call demo2.hello2 at:  1530693198252
call demo3.hello3 at:  1530693198253
call demo1.hello1 use time =  8
call demo1.hello1WithEvent at:  1530693198255
listen t1 at:  1530693198255
call demo2.hello2WithEvent at:  1530693198303
listen t2 at:  1530693198303
call demo3.hello3WithEvent at:  1530693198348
listen t3 at:  1530693198349
call demo1.hello1WithEvent use time =  138

Info about my system:

Kubuntu 18.04 64bit Kernel 4.15.0-24 CPU: Intel i7 6700 RAM 16GB

NodeJS: 8.11.3 Nats Server: 1.1.0

icebob commented 6 years ago

Thanks @AndreMaz

roytan883 commented 6 years ago

After long time digging and testing, i found the reason. I don't know its NATS or Nodejs bug. The Nodejs said noDelay defaults to true, but its not.

socket.setNoDelay([noDelay]) Disables the Nagle algorithm. By default TCP connections use the Nagle algorithm, they buffer data before sending it off. Setting true for noDelay will immediately fire off data each time socket.write() is called. noDelay defaults to true.

Just modify this file at: node_modules/nats/lib/nats.js

// Create the stream
    this.stream = net.createConnection(this.url.port, this.url.hostname);
    this.stream.setNoDelay(true)  // need call it, somehow default 'true' is not work

After manually add setNoDelay(true), it works fine on ubuntu:

call demo1.hello1 at:  1530870151475
call demo2.hello2 at:  1530870151477
call demo3.hello3 at:  1530870151478
call demo1.hello1 use time =  8
call demo1.hello1WithEvent startTime =  1530870151480
call demo1.hello1WithEvent at:  1530870151481
listen e1 at:  1530870151482
call demo2.hello2WithEvent at:  1530870151483
listen e2 at:  1530870151484
call demo3.hello3WithEvent at:  1530870151484
listen e3 at:  1530870151485
call demo1.hello1WithEvent endTime =  1530870151486
call demo1.hello1WithEvent use time =  6

test environment:

moleculer: 0.12.8
nodejs: 8.11.1
node-nats: 1.0.0
gnatsd: 1.2.0
system: ubuntu 16.04 server
icebob commented 6 years ago

Wow, nice catch! It can explain why has the difference between OSs. So on Windows, the noDelay can be true, and on Linux it's not.

icebob commented 6 years ago

And it's the reason why was not problem with Redis transporter: https://github.com/luin/ioredis/blob/622975d9d454eca6a9c495d35f5638d68e2662f2/lib/redis.js#L311

icebob commented 6 years ago

Btw, I can add a workaround, that I set the setNoDelay always in the transporter code when the client connected.

roytan883 commented 6 years ago

@icebob Maybe you should also listen connect and reconnect events from NATS. Because if reconnect, the down layer stream(socket) object will be a new one.

    client.on("connect", () => {
        this.client = client;
        if (client.stream)
            client.stream.setNoDelay(true);

    client.on("reconnect", () => {
        this.client = client;
        if (client.stream)
            client.stream.setNoDelay(true);
icebob commented 6 years ago

Yes, good idea!

aricart commented 6 years ago

npm already has a fix for this, you may want to just update and remove the patching.

roytan883 commented 6 years ago

@aricart Thank you for fixing this

aricart commented 6 years ago

@roytan883 yw!!