felixbarny commented 7 years ago

It’s finally done, well, almost… A lot of things have changed that’s why we decided to first make a release candidate before releasing proper. So please try this out but be aware that this is not production ready yet as there might be some bugs lingering in there.

So what exactly is (almost) done?

Stagemonitor now supports distributed tracing :tada:. This is a technique which is especially useful (not to say necessary) when you are operating a set of distributed services which interact with each other (not to say microservices).

What is distributed tracing and why should I care?

Distributed tracing enables you to „debug“ those kind of architectures. With debug I mean to find out what happened during a particular request, what caused an error and why the request was slow. Traditionally, all what was required to understand what was happening inside an application was to attach a profiler. For example, you could use previous versions of stagemonitor in your application and the call tree would show you which methods and JDBC queries were particularly slow. But when a request is not served by only one service but potentially dozens or hundreds, things get a bit more complicated. This is where distributed tracing comes in. It helps you reason about which services were involved during a request and how much time each service contributed to the total execution time.

The main concepts in OpenTracing are spans and traces. Traces consist of multiple spans from different services. A trace groups spans belonging to a single transaction. The trace propagates through a (potentially distributed) system. More information about OpenTracing can be found on the official website http://opentracing.io/.

How is distributed tracing implemented in stagemonitor?

Stagemonitor uses the OpenTracing API, which is a vendor and language neutral standard for distributed tracing. The two most important advantages for stagemonitor are that the actual OpenTracing implementation can be changed at any time and that stagemonitor can take advantage of 3rd party libraries which add support for certain technologies, like OkHttp. In fact, stagemonitor already supports multiple OpenTracing implementations.

Zipkin support

In practice, that means that it is not only possible to send tracing data to Elasticsearch, which is the „traditional“ way for stagemonitor. As a new backend, Zipkin is also supported. This is enabled by the Brave OpenTracing bridge which uhm… bridges the OpenTracing API stagemonitor uses to Brave, the library of choice when it comes to reporting to Zipkin. This is also interesting to Brave users, as stagemonitor can automatically instrument their applications without any code changes.

Log Correlation

Stagemonitor can also assist in correlating logs, as it adds useful information to slf4j’s MDC (Message Diagnostic Context). This information can be used to identify from which application host and instance a log line is coming from (nothing new so far) as well as which trace the log belongs to (great news, everyone).

Distributed tracing sounds nice, but why should I use stagemonitor?

There are a few things which set stagemonitor apart from other tracing solutions.

One interesting feature stagemonitor supported from the beginning is the included profiler which generates a call tree for a request. The call tree shows you which methods were executed during a request and which ones were particularly slow. This feature gets a new dimension in the context of distributed tracing. Usually, you would „only“ find out which application was slow, but not why. Maybe the reason for the application to be slow was not that it executed a lot of requests to other applications, but that the programming was inefficient. Also, even if you have found out that one applications executes 1000 SQL statements, you do not necessarily know where it happens in the code and why. This is where stagemonitor’s distributed call tree can help out. Stagemonitor answers the question, why the application is slow and where the slow code calls are located.

Another feature of stagemonitor is that there are no code changes required to your application. You do not have to manually implement any tracing code or configure 3rd party modules. Stagemonitor transparently injects tracing into your code via byte code manipulation.

Unlike most libraries capable of distributed tracing, stagemonitor also extracts metrics from the spans it collects. These metrics include response time percentiles, throughput and error rates. When you are embracing distributed tracing, you are likely to only store a fraction of the actual traces in your backend (aka. sampling) to not overwhelm your drives. This also means, that you might miss some information, especially about outliers. The advantage of metrics are, that they do not grow in size as your applications serves more requests. Metrics are always calculated for all requests. no matter if their corresponding spans are sampled or not.

stagemonitor-grafana-ot

Whats next

The journey has just begun and there is still a lot of work in front of us. For example, the support for span context propagation is quite limited yet. Currently Spring’s Rest template is instrumented, so that it sends information about the span context downstream via HTTP headers. This information is then used to correlate spans which belong to the same trace.

Another exciting thing we are currently working on is a major overhaul of the end user monitoring. In the future, stagemonitor will provide more insight what is going on in the browser in terms of page load time, JavaScript errors as well as Ajax spans, which correlate with server spans. It will even be possible to monitor arbitrary HTML websites, so that also non-Java applications can be monitored.

Migration

The OpenTracing API is now at the very core of stagemonitor. That meant a lot of refactoring and a lot of changes. Some changes also directly affect users of stagemonitor.

The most notable changes are the renaming of the modules stagemonitor-web to stagemonitor-web-servlet and stagemonitor-requestmonitor to stagemonitor-tracing. Also, you will now have to include the module, which is responsible for reporting the spans to a particular backend. So if you want to report to Elasticsearch, do not forget to add a dependency to stagemonitor-tracing-elasticsearch. If you want to report to Zipkin, add a dependency on the stagemonitor-tracing-zipkin-module.

One of the former most central classes in stagemonitor - RequestTrace - has been replaced with the Span interface from the OpenTracing API. So, if you previously enhanced the request trace with your own custom information, you will need to migrate to the Span interface. To add a custom value to the current span, just use TracingPlugin.getCurrentSpan().setTag("foo", "bar"). The migration should be straight forward.

The library UADetector, which stagemonitor uses to parse the User-Agent header, is discontinued unfortunately, as the underlying data base is not free anymore. This is the reason, why the parsing of the user agent is deactivated by default. If you want to enable it, set stagemonitor.requestmonitor.http.parseUserAgent = true and add a dependency on net.sf.uadetector:uadetector-resources:2014.10. In the future, stagemonitor will support a library with an up-to-date user agent library out of the box. In the meantime, you can also look at the Elasticsearch ingest user agent plugin. See https://www.elastic.co/guide/en/elasticsearch/plugins/master/ingest-user-agent.html for more information.

For a comprehensive list of all features and breaking changes, please refer to the release notes.

Thank you

Thank you to everyone who participated in the process and who gave feedback (@ryanrupp, @marcust, @kishoremk, @trampi). A special thanks to @adriancole, who tested the Brave/Zipkin integration, gave me valuable tips and was an overall nice guy. It’s always cool when technology connects people which previously did not have anything in common.

Feedback

If you have any kind of feedback, please share it. Either as a comment to this post or as an issue in the stagemonitor repo.

felixbarny commented 7 years ago

/cc @bloomper @hrzbrg

hrzbrg commented 7 years ago

Thanks for you awesome work Felix. I have deployed the first services with 0.80.0.RC1 to our testing environment (Java 8, Tomcat 8). I also tested again to run it without -javaagent but it won't work fully.

felixbarny commented 7 years ago

Thanks for testing. Any error messages?

hrzbrg commented 7 years ago

See the attached file: bildschirmfoto 2017-06-12 um 14 19 26

felixbarny commented 7 years ago

Are you using a JDK or a JRE?

hrzbrg commented 7 years ago

Oh well, stupid me was sure that there was a JDK in the container but there is only a JRE. However the JDK blows up our container by ~100MB, so we will most likely stay with the agent.

felixbarny commented 7 years ago

Could you share somewhere, like in the faq section in the wiki, how to apply the agent in a docker container?

felixbarny commented 7 years ago

Known issues in this version:

Turning on the user agent parsing can lead to ConcurrentModificationExceptions.
Sampling does not work with stagemonitor-tracing-zipkin as the sampling decision is made (directly) after Span#start. At this point brave ignores the sampling.priority tag as it can only be set before starting a span.

codefromthecrypt commented 7 years ago

do you have a code link for where the sampling decision discrepancy occurs? Also, is there a reason why this is set late?

hrzbrg commented 7 years ago

I can try to break down our Container to the minimum to describe how we load stuff there (when I find the time).

Also I have another issue. For me the saved search for Request Analysis in Kibana does not appear, but other Dashboards and Visualisations are available. And there is a problem with Grafana. If I use the Request Metrics Dashboard and select for example an Application with Operation Type jdbc, Grafana frontend will crash when this Application does not have any jdbc connections.

stagemonitor / stagemonitor-mailinglist

Stagemonitor 0.80.0.RC1 released #46

So what exactly is (almost) done?

What is distributed tracing and why should I care?

How is distributed tracing implemented in stagemonitor?

Zipkin support

Log Correlation

Distributed tracing sounds nice, but why should I use stagemonitor?

Whats next

Migration

Thank you

Feedback