APM summit - Githubissues

watson commented 8 years ago

_APM stands for Application Performance Management._

We had a really good tracing/APM session at NodeConf Adventure two days ago with many of the APM vendors represented (NodeSource, Dynatrace, AppNeta and Opbeat).

There seemed to be a general agreement that we would all benefit from working closer together. A first step in this process would be to arrange an APM summit and meet up in person. Kind of like the error summit held this January.

It would be most beneficial if we could narrow the scope of the summit as much as possible. I'd like if the fist item on the agenda could be to lay out a roadmap of what we would like to achieve, but please pitch in below.

I suggest that we have the summit at NodeSummit in San Francisco on July 25th (the day before the conference starts). I've heard they have an extra meeting room that we might be able to borrow (I'll follow up with more details).

Here a some of the notes from the NodeConf Adventure session:

APM is hard to get right in Node.js (lot's of monkey patching, lot's of edge cases, lot's unsolved issues)
Callback queues in user-land modules are especially hard (think generic-pool)
Maybe we should formalise a generic tracing protocol for user-land modules to use if they want to be easily traceable
Everyone keeps reinventing the wheel
Part of the problem space we deal with every day, might be better solved in Node core
A good first step would be to create a roadmap of what we as a group want to achieve by working together
Having regular in-person summits would be of great value and helps speed things up (this is how TC39 gets all their stuff done)
The foundation might be able to help pay to get key people to attend who can't get their employer to sponsor

I most likely forgot some of what we discussed, so please add your comments below. In fact, all comments are highly appreciated 😃

Action needed:

Please fill in this Doodle if you want to attend the APM Summit and mark the dates / locations you are able to attend: http://doodle.com/poll/utqxycqki8chyddd

/cc @othiym23 @brycebaril @danielkhan @groundwater @qard @dshaw

AndreasMadsen commented 8 years ago

What does the acronym APM mean? "Asynchronous Programming Model" doesn't make sense to me in this context.

ofrobots commented 8 years ago

@watson Thanks for posting this. APM summit would be great! Please count me and @matthewloring in as well. It would be great if we could formalize on some generic APM tracing protocol.

Insofar as context loss due to user-land queuing is concerned, I started writing this simple module that could be a good starting point: https://github.com/ofrobots/context-is-everything. The basic idea this could be a central protocol that context observers (APM modules, continuation-local-storage, etc.) and user-land queuing modules (mongodb, mysql, redis, grpc, etc.) could both sign up to in order to propagate async context. Here's an example patch on how continuation-local-storage could work with this module: https://github.com/ofrobots/node-continuation-local-storage/commit/8fca4138d7d50b2c9989859fcccf546cf10cf98b.

@AndreasMadsen APM stands for Application Performance Management.

hmdhk commented 8 years ago

Zone.js provides similar functionality but at Node.js api level. It is far from being comprehensive but their new api is interesting.

mhdawson commented 8 years ago

Sounds good to me as well. @tobespc and @mchamberlain FYI

danielkhan commented 8 years ago

Thank's for putting that all together @watson - I'm obviously in as well.

rvagg commented 8 years ago

iirc ES Modules presents some important challenges for APM, mainly due to their static nature and the standard APM monkey-patching pattern, that should probably be on the agenda unless I'm not remembering the details correctly. It's really important that we have APM in the Modules discussion so we can move forward without leaving a massive chunk of our tooling ecosystem behind.

Qard commented 8 years ago

Indeed. @bmeck has already contacted some of us about ES6 module concerns. It'd be good for us to find a solution together face-to-face.

joshgav commented 8 years ago

Thanks @watson, I'd certainly like to meet you all F2F 👍 /cc @avanderhoorn @nikmd23

A first step in this process would be to arrange an APM summit and meet up in person.

Could also help to have some open discussions in this repo now. Some topics from this thread which could be issues/topics:

monkey-patching and ES6 module semantics;
protocol for traces (e.g. name, level, object with expected props);
F2F meetings

Maybe we should formalise a generic tracing protocol for user-land modules to use if they want to be easily traceable.

I didn't specify what the payload objects would look like, but was prototyping architecture and API for this in #50 and joshgav/node-trace.

A few more places we might start from:

http://opentracing.io/
http://diamon.org/
Performance Observer (e.g. https://developers.google.com/web/updates/2016/06/performance-observer)

Also see #53 for the work @matthewloring and @ofrobots are doing.

Part of the problem space we deal with every day, might be better solved in Node core.

My module referenced above integrated into core: joshgav/node/trace-event-integration.

Seems like putting a trace system in core will be necessary to enable data collection without requiring developers to explicitly opt-in (e.g. by importing a module).

danielkhan commented 8 years ago

In addition to the tracing facilities needed, I think we should also define metrics like event loop timings that should be provided via a potential API.

mcollina commented 8 years ago

👍 for providing internal APIs for event loop timing. I'm currently using http://npm.im/loopbench, which is far from ideal.

megastef commented 8 years ago

+1 for the API to get pre-aggregated metrics from node core and solve various problems for each metric type:

GC time, GC runs, Avg Released Memory per GC type - please see https://github.com/nodejs/node/issues/4496 - often a problem for windows users to compile native packages ...
EventLoop latency - most solutions inject frequently event to the event loop, and measure time when the event was handeld. But actually this puts much more events to the event loop as usual ...
http stats for client (e.g. accessing API's) and server - typically a monkey patching adventure ...

Instead of handling all kinds of events (like GC) and run own aggregations in user land, it might be much more efficent when stats could be collected in node core and emitted in a defined interval e.g. every 10 seconds or once a minute. For example listening to each GC event (as we do today), would trigger many times the function that collects metrics, while an internal function could just update internal counters/arrays and emit the event once a while. like process.on('stats', statsListener) resulting in an objectct providing most relevant key metrics like this:

{
    gc: {
       full_cycles: {
          duration: 200, 
          count: 4
          releasedMemory: 1024
       },
       sc_cycles: {
          duration: 200, 
          count: 4,
          releasedMemory: 1024
       }
   } 
    eventloop_latency { 
         min: 0.001,
         max: 10,
         avg: 2
    }, 
   http_server: { 
    requests: 10, 
    rx: 1200, 
    tx: 500
    response_time: {
         min: ..., 
         max: ..., 
         avg: ...,
    } 
   status: {
       2xx: 196
       3xx: 1,
       4xx: 2,
       5xx: 1 
   }
  }, 
  http_client {
          ...
  },
  upd_stats: {},
  tcp_stats: {},
  fs_stats: {}
}

All values should be reset in after emitting the 'stats' event - I've seen often API's that just count up and agents collecting this data have to keep last value and calculate the differnence to current value, another waste of CPU cycles ...

BTW, I wrote a while ago an article about Node.js metrics and hope it is helpful for the discussion: https://sematext.com/blog/2015/12/02/top-nodejs-metrics-to-watch/

danielkhan commented 8 years ago

With one month to go until Node Summit and flight bookings coming up, I think we should announce the APM Summit and set a time and date (25th or 26th of July).

After the date has been set: How do we get the message out to all vendors in space. Could some neutral entity like maybe @othiym23 or @brycebaril take care of letting the right people know?

Qard commented 8 years ago

AppNeta seems to be unwilling to send me down for this. 😞

yunong commented 8 years ago

Please sign me up for this summit. We've been working on USDT support for perf and ebpf on Linux, and would also like to discuss how we could more tightly integrate this into restify.

jkrems commented 8 years ago

(For people like me who had to google what USDT stands for: http://www.brendangregg.com/blog/2015-07-03/hacking-linux-usdt-ftrace.html)

yunong commented 8 years ago

For additional details on USDT: see this issue @brendangregg has filed https://github.com/nodejs/diagnostics/issues/61

brendangregg commented 8 years ago

Thanks @yunong; that's the Linux perf_events work, which is all mainline.

There's also the Linux bcc/BPF work, where the BPF is mainline and bcc is a python add-on. @goldshtn wrote a post showing initial Node.js USDT support here:

http://blogs.microsoft.co.il/sasha/2016/03/30/usdt-probe-support-in-bpfbcc/

watson commented 8 years ago

Important update: The APM Summit was on the agenda at yesterdays Node.js Diagnostics Working Group meeting and it was decided to not have it at the NodeSummit this month.

If you're interested, you can watch the APM segment from the meeting on YouTube or read the minutes.

Action needed:

Please fill in this Doodle if you want to attend the APM Summit and mark the dates / locations you are able to attend: http://doodle.com/poll/utqxycqki8chyddd

joshgav commented 8 years ago

Opened a continuation of this as a proposal for the Austin collaboration summit: https://github.com/nodejs/summit/issues/30

joshgav commented 8 years ago

Closing in deference to Austin Summit thread, @watson - please re-open if you'd like. Thanks!

nodejs / diagnostics

APM summit #58

Action needed:

Action needed: