opencb / opencga

An Open Computational Genomics Analysis platform for big data genomics analysis. OpenCGA is maintained and develop by its parent company Zetta Genomics. Please contact support@zettagenomics.com for bug report and feature requests.
Apache License 2.0
166 stars 97 forks source link

Writeup options for distributed tracing and metrics #1014

Open lawrencegripper opened 5 years ago

lawrencegripper commented 5 years ago

This is to investigate beyond what can be offered by adding a health-check endpoint #1015 and using tools to collect the existing lots and basic box metrics #1016

The aims of the additional tracing and metrics would be to provide quick resolution of issues through automatic alerting, detailed user session tracing and system-wide metric collection.

jjcollinge commented 5 years ago

I'll try and keep this as objective as possible but no doubt some subjectivity will seep in so please apply your own thinking on top.

I believe the health endpoints added in #1015 are used as a way to 'black box' monitor the solution as an external client. This is a great way to determine "what" is happening from a client or users perspective.

The work done in #1016 was to enable pre-built Microsoft monitoring solutions to monitor the containers, virtual machines and HDI. These monitoring solutions will ship component and infrastructure logs and health reports to Azure Log Analytics. Although this issue does not yet resolve all infrastructure and component monitoring, I expect it will do in the fullness of time i.e. HDInsight support, MongoDB, Metrics, etc.

Currently we have not looked at application level monitoring (logs and tracing). There are a many different technologies in this space, many that are complimentary and others that overlap. I'll only discuss a few options here but you may find via your own due diligence a better solution for you.

Firstly, to differentiate tracing and logging. Log = a single event within an application specific context Tracing = a trace of an event through the entire system correlated across application boundaries

Azure Application Insights is Microsoft's first party application monitoring solution. Application Insights would be added to OpenCGAs Java Code as a library and can be configured to instrument request, track dependencies, collect performance counters (via JMX), diagnose performance issues and exceptions, as well as adding custom events and application logs. All of this monitoring will be sent off box to the Azure Application Insights service. This will then be indexed and made available to you through Azure Monitor under "Applications". You can use the same query language that you use over Log Analytics. You put graphs and visuals in the same dashboards as your Log Analytics and you create Alerts the same way too. Azure Application Insights is fully supported by Microsoft.

OpenCensus is a unified framework for telemetry collection that provides a suite of client libraries (Go, Java, C#, Node.js, C++, Ruby, Erlang/Elixir, Python, Scala and PHP). The libraries send trace and metric data to backends that have been implemented by vendors (Azure Monitor, Datadog, Instana, Jaeger, SignalFX, Stackdriver, and Zipkin). Some backends support only traces or only metrics and others support both. The USP of OpenCensus is the standardised data schemas, correlation across both traces and metrics and array of supported backends. By using the OpenCensus libraries you'll be using a standard interface that won't lock your code into a particular vendor. It's also worth noting that Application Insights is now adopting OpenCensus support via the LocalForwarder. To see which backends support traces and metrics look here. You'll see in Java that Azure Monitor only supports traces. However, you can export OpenCensus metrics to Prometheus. OpenCensus seems to have good momentum across cloud vendors but is still fairly immature.

Prometheus is a time-series database for highly efficient storing and indexing of metrics. Typically you'd use a library that'd expose your application metrics in a Prometheus compatible format on an accessible HTTP endpoint. Prometheus would scrape the endpoint every 'x' seconds, aggregate the data and evaluate any rules to see whether it should trigger an alert. Using prometheus in it's native distribution would require to run prometheus master instances and potentially some collectors. Prometheus is a battle ready technology and often hooked up with Grafana - however, it does involve running additional components.

OpenTracing is an alternative standard to OpenCensus that has built a vendor neutral tracing API and clients with support for (Go, JavaScript, Java, Python, Ruby, PHP, Objective-C, C++, C#). OpenTracing is a little more mature than OpenCensus and is part of the CNCF. However, OpenCensus appears to be a little more vendor friendly so Microsoft, AWS, etc seem to have better interoperability with OpenCensus. There is a 3rd party adapter for OpenTracing to Application Insights which we have used in another project (https://github.com/petabridge/Petabridge.Tracing.ApplicationInsights). Supported OpenTracing tracers: CNCF Jaeger, LightStep, Instana, Apache SkyWalking, inspectIT, stagemonitor, Datadog, Wavefront. Each of these implementation is a little different but they typical require you to run a "collector" locally that forwards to a HA "master" hosted elsewhere. Jaeger and possible some other implementation also support exposing Prometheus compatible metrics.

Graphana is an analytics dashboard for your metrics. It allows you to query, visualise and alert on your metrics. It's typically used with prometheus but supports a number of other data sources including Azure Monitor. If you want a more flexible dashboarding tool outside of Azure, consider running a Graphana instance and ingesting your Azure Monitor telemetry.

Azure Monitor is Azure's suite of monitoring related services. It provides a single interface to access Application Insights, Log Analytics, Alerts, Diagnostic logs, Azure Metrics, etc. For more information on the topology of Azure Monitor, look here