Expose JVM and REST endpoints metrics to JMX

alechenninger commented 7 years ago

I think as is we'll get all entities' inserts/finds/updates/saves lumped together. I will take a deeper look tomorrow.

svkm102 commented 7 years ago

My initial plan was to have all rest calls annotated to gather metrics at individual calls, but dropwaizard doesn't provide a easy way to do that (sans Spring). Dropwizard web application instrumentation follows a more generic approach of gathering metrics based on a endpoint url pattern ( /search/ ) instead of the actual endpoint. While technically we can still create filters for all possible combinations of /search/, that just makes web.xml file a mess.

So I only configured the base url pattern for each individual endpoint. So we have /search/ , /find/, /insert/* etc gathering aggregate metrics for that pattern. /metrics gathers all JVM metrics data (GC, Memory and Threads).

alechenninger commented 7 years ago

So I looked around at some possible RESTEasy/JAX-RS facilities like interceptors to make the metrics capture have more context like path parameters, and didn't come up with many ideas.

We have a few requirements here:

Capture operations by at least entity, and potentially also version, in order that aggregates nicely. So like insert.user.1.0.0 would easily aggregate in graphite queries as all inserts (insert.*) or all inserts for users (insert.user.*), etc. Other orders would be awkward.
Handles streaming responses that will be introduced with #289 . This means that for example just timing the resource method is not enough, since it can return with a streaming entity immediately before the response is actually finished being written or even started.

I see two ways to do this:

Either add timer and error metric tracking code to all Command objects. This gives us total control over the content in the metric, for example if it's not in the path parameters we can still include it.
Or write a filter that converts the whole path to a metric replacing '/' with '.'. The advantage is that it's a one-size-fits-all solution. The disadvantage is that it relies on path segments being ordered how we want, and can only use information in the URL path.

I'd probably lean towards 1 since it's not that hard to use the timers and meters in dropwizard. Some of the common code can be shared. Perhaps write some kind of interfaces like:

interface RequestTimer {
  Context start(String operation, String entityName, String entityVersion);

  interface Context extends AutoClosable {
    /** Stops timer and marks as success. */
    @Override 
    close();

    /** Stops timer and marks as failure. */
    failure(Exception e);
  }
}

...which could use dropwizard under the hood and track timers and meters for the errors. Just an idea.

alechenninger commented 7 years ago

Actually I may have lied a little bit, graphite can do wildcards at any node so even if the order of path segments was weird it wouldn't be a big deal, maybe just an odd annoyance. Still, a natural specificity-based order would be preferable.

svkm102 commented 7 years ago

Modified dropwizard metrics-servlet code to gather metrics at namespaces containing full path (eg: api.rest.data.find.user.1.0.0.$metricName). The following metrics are supported per rest endpoints:- activeRequests : all active requests errors : all errors or exceptions requests : all requests timeouts : all timeouts responsecode. -- badrequest : response 400 -- created : response 201 -- noContent : response 204 -- notFound : response 404 -- ok : response 200 -- serverError : response 500

The solution is based on the 2nd approach suggested above. The major issue with this approach is with timing streaming responses :

To handle streaming responses, this parses each response chunk to determine if it is the last, which will add to the processing burden and load and is not a very nice way to handle it.
The time taken to process streaming response also depends on the ability of client to process the stream and will not be the actual indicator of lightblue processing.

Apart from request timing, all other metrics for streaming endpoints should work as normal endpoints.

alechenninger commented 7 years ago

We've talked about this in person so I'll wait for updates until reviewing. I did what to clarify this point however:

The time taken to process streaming response also depends on the ability of client to process the stream and will not be the actual indicator of lightblue processing.

I wanted to touch on this a bit more to clarify (for my own education and others').

For non-streaming responses, if we only timed the method in the JAX-RS resource, we are timing how long our database calls and business logic take. It does not include any of the overhead of writing to the response channel.

For streaming responses, we do include that overhead. There is no way around it of course because our iteration through the cursor is intermingled with writing to the response stream. However, there are a few reasons I don't think this should be much overhead:

IO like this may be asynchronous in the servlet container (in newer containers it may be for better performance). It will definitely be asynchronous if asynchronous servlet responses are used (we're not here, but maybe we should be?).
There are buffers both in Java and in the kernel. The kernel will only block if the TCP window is full. So even if IO was synchronous, my understanding is you would only block when your buffers need to flush and the TCP window is full. This could happen if a client was slow to acknowledge TCP packets and the response is large enough.

So, naively I would think most of the time this overhead is negligible or non existent. In either case, the server is bound by client TCP acknowledgements somewhere, and both may use chunked encoding or not if the response is sufficiently large or small (contrary to what the lightblue streaming code would suggest). It just depends whether or not we're tracking this as part of our metrics.

Still, we should track the streaming latencies and non streaming latencies separately. This way we can actually measure what the difference is. This should be easier to do by tracking the timing manually also.

svkm102 commented 7 years ago

Refer #297 for updated PR.

lightblue-platform / lightblue-rest

Expose JVM and REST endpoints metrics to JMX #294