hapijs / good

hapi process monitoring
Other
525 stars 161 forks source link

Performance of good seems really bad #583

Closed kswope closed 5 years ago

kswope commented 6 years ago

Sorry I couldn't resist the title

Adding the good plugin drops my toy web apps performance by about 50% over my own logging, even if I remove the reporters options.

Simply replacing this

server.events.on( 'response', function( request ) {                                                     
  console.log(Date.now() + ': ' +                                                                       
     request.info.remoteAddress + ': ' +                                                                 
     request.method.toUpperCase() + ' ' +                                                                
     request.url.path + ' --> ' +                                                                        
     request.response.statusCode );                                                                      
} );

with this

await server.register( {                                                                                
  plugin: Good,                                                                                         
} ) 

Drops performance from

Running 1m test @ http://localhost:3000
10 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 10.79ms 3.94ms 72.60ms 93.46%
Req/Sec 94.81 15.67 240.00 81.94%
Latency Distribution
50% 9.76ms
75% 10.80ms
90% 12.76ms
99% 25.39ms
56731 requests in 1.00m, 13.31MB read
Requests/sec: 943.98
Transfer/sec: 226.78KB

to

Running 1m test @ http://localhost:3000
10 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 18.35ms 4.41ms 88.02ms 91.22%
Req/Sec 54.87 8.42 101.00 87.83%
Latency Distribution
50% 16.90ms
75% 18.25ms
90% 21.69ms
99% 34.23ms
32937 requests in 1.00m, 7.73MB read
Requests/sec: 548.12
Transfer/sec: 131.68KB

I'm sure good does all kinds of things behind the scenes, but this just doesn't look right.

https://gist.github.com/kswope/83af52730802f1e4ae786b76e5741b34

AdriVanHoudt commented 6 years ago

You are right, Good does do way more, check https://github.com/hapijs/good/blob/master/lib/monitor.js to get an idea of what it is doing. If you have perf recommendations they would be very welcome!

arb commented 6 years ago

You may want to check out https://getpino.io/#/ if speed is all you're looking for. I agree with @AdriVanHoudt if you have any perf recommendations, I'd be very interested in hearing them.

frankthelen commented 6 years ago

I am trying to reproduce this but I can't. From what I see, there is no noteworthy negative effect with using Good logging. Here is what I did.

Components used:

My Hapi has only one route which does nothing within exactly 100 ms. It's running on http://localhost:3000 on Node v8.9.4 (not using nodemon or such) on my Mac. I am comparing no logging with Good Console request logging by simply commenting in or out the line await server.register(good); in the code below.

const Good = require('good');
const Hapi = require('hapi');

const server = new Hapi.Server({
  port: 3000,
});

server.route({
  method: 'GET',
  path: '/',
  handler: async function(request) {
    return new Promise((resolve) => {
      setTimeout(() => {
        resolve('OK');
      }, 100);
    });
  },
});

const good = {
  plugin: Good,
  options: {
    reporters: {
      myConsoleReporter: [{
        module: 'good-squeeze',
        name: 'Squeeze',
        args: [{ response: '*' }]
      }, {
        module: 'good-console'
      }, 'stdout'],
    },
  },
};

const init = async () => {
  try {
    // await server.register(good); // <<<
    await server.start();
  } catch (error) {
    console.error(error);
    process.exit(1);
  }
};

init();

I am using artillery 1.6.0-14 for load testing. artillery quick -c 100 -n 1000 http://localhost:3000 creates 100 "virtual users" each of which will send 1000 HTTP GET requests. I started with less but this seams to be what @kswope was doing.

Here are the results. There is basically no difference. I tried a couple of times.

(1) No logging

artillery quick -c 100 -n 1000 http://localhost:3000
Started phase 0, duration: 2s @ 20:39:03(+0100) 2018-03-10
...
All virtual users finished
Summary report @ 20:41:00(+0100) 2018-03-10
  Scenarios launched:  100
  Scenarios completed: 100
  Requests completed:  100000
  RPS sent: 851.21
  Request latency:
    min: 99.6
    max: 184.5
    median: 106.9
    p95: 118.6
    p99: 133.3
  Scenario counts:
    0: 100 (100%)
  Codes:
    200: 100000

(2) With Good logging (see above)

artillery quick -c 100 -n 1000 http://localhost:3000
Started phase 0, duration: 2s @ 20:42:03(+0100) 2018-03-10
...
All virtual users finished
Summary report @ 20:43:55(+0100) 2018-03-10
  Scenarios launched:  100
  Scenarios completed: 100
  Requests completed:  100000
  RPS sent: 893.73
  Request latency:
    min: 99.5
    max: 196.3
    median: 104.4
    p95: 114.7
    p99: 128.2
  Scenario counts:
    0: 100 (100%)
  Codes:
    200: 100000

Still wondering what caused @kswope's results. Cannot see anything suspicious in the Good code. Can someone please double-check?

kswope commented 6 years ago

@frankthelen

You replaced my simple database query with a 100ms timeout. Maybe that would be accurate if you were simulating a slow database connection.

frankthelen commented 6 years ago

Yes, I wanted to avoid any other potential influences. The 100ms timeout was supposed to simulate an asynchronous task, could be a database request, could be anything else. This way we are sure that the load doesn't affect this parameter.

kswope commented 6 years ago

That's a huge bottleneck, even a very broken logger would look good with that in the mix. My point was that Good (without even logging) was so much slower than everything else I tried. I thought it was fair to add a simple database query to the benchmark, since its not that uncommon in a web app.

frankthelen commented 6 years ago

Ok, I see what you mean. Adding a database query to the mix limits the available resources on my machine so that the three processes (db server, hapi/good, load generator) more obviously run into bottleneck problems. Ok, let's try that again. I replace the 100ms dummy task with a simple but real database query (pg on localhost). Looking at the Hapi/Good Node process and comparing two scenarios: (1) Hapi w/o Good and (2) Hapi with Good but without any actual logging (no reporters). Running the same load test artillery quick -c 10 -n 50000 http://localhost:3000 against both scenarios. Collecting profiling data node --prof for both scenarios. Here's the result:

(1) Hapi w/o Good

 104521 CPU ticks in total, thereof
  27976 (26.8%) JavaScript
  67279 (64.4%) C++
   1868 (1.8%) GC

(2) Hapi with Good (but no reporters)

 162826 CPU ticks in total, thereof
  49353 (30.3%) JavaScript
 100241 (61.6%) C++
   4332 (2.7%) GC

Yes, (2) needs about 55% more CPU ticks than (1). Hmmm. Unfortunately, I am not really experienced in looking at such profile outputs. Don't see anything obvious except that the numbers at various places are significantly higher. Will have a closer look.

kswope commented 6 years ago

@frankthelen

Ok, I see what you mean. Adding a database query to the mix limits the available resources on my machine so that the three processes (db server, hapi/good, load generator) more obviously run into bottleneck problems.

Has nothing to do with it. You were introducing a large "rate limiting step" which overwhelmed any difference in the scenarios. With a 100ms timeout you could probably play world of warcraft during the benchmark and get the same results.

And in the comparison between Good and reporters and Good without reports, this also has nothing to do with it. Simply adding Good (even without any reporters) dropped the performance in half (even with a database query step to make the scenario more realistic)

frankthelen commented 6 years ago

Anyway. :-) The key question remains: what is the cause for this? I have no idea.

frankthelen commented 6 years ago

Ops seams to be the malefactor, i.e., oppsy.

Switching off ops results in 104854 CPU ticks in the load test scenario above. This is about the same value like without Good at all. That's good news.

Unfortunately, Good has ops switched on by default. With an interval of 15s it collects ops information from the Hapi server. You have to actively switch it off. Like so:

const Good = require('good');
const good = {
  plugin: Good,
  options: {
    ops: false,
    ...
  },
};
...
await server.register(good);

@arb I'd recommend the following:

lock[bot] commented 4 years ago

This thread has been automatically locked due to inactivity. Please open a new issue for related bugs or questions following the new issue template instructions.