vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.37k stars 1.51k forks source link

Specification for metrics collection #2007

Closed binarylogic closed 4 years ago

binarylogic commented 4 years ago

As a follow up to #1761 we'll need a spec on which metrics we want to collect. I would like to inventory all of these metrics so we can obtain consensus and document them properly. This can be as simple as a list with metric names and labels.

lukesteensen commented 4 years ago

Metrics Collection

Philosophy

For inspiration, we'll look at the RED and USE methodologies. Rate and errors are virtually always relevant, and depending on the component, utilization and duration can be as well.

For rate, we should record an events counter that records each event passing through a component. This will provide a baseline against which to compare other numbers, let us derive event rates per time period, etc. Where relevant, we should also record a total_bytes counter. This will give us a rough idea of total data volume, let us calculate average event size, etc.

Errors are pretty self-explanatory. We don't need metrics for every possible kind, but we should look over error! locations in the codebase and instrument those that have a chance to happen repeatedly over time.

Utilization is important for components that take a significant amount of some non-CPU system resources. For example, we measure the memory utilization of the Lua transform. Memory utilization of in-flight batches is another good example.

Duration applies to components like HTTP-based sinks, where there's some request-response cycle we want to time. It can also be used around things like the Lua transform where runtime can depend heavily on the configuration.

Implementation

Naming

As much as possible, names for the same type of metrics should be consistent across components. They should be namespaced by the category of component (i.e. source, transform, sink, internal) and use common suffixes for the same data (e.g. events and total_bytes.

The example instrumentation so far in #1953 uses a rough {category}.{type}.{name} scheme. We could alternatively break out one or more of the namespacing components into tags. I think this could make sense for type especially. Opinions wanted.

Shared components

The naming scheme above runs into some complications with shared subcomponents or those that are simple wrappers around another. Since we don't know the whole runtime context at the callsite, we can't include things like type.

The current examples simply omit that portion of the key and rely on the name. A perhaps better alternative is to make type always a tag (as discussed above) so that we can add it seamlessly later with a tracing-based metrics backend.

Durations

In certain areas of the code, measuring durations is currently very complex due to the pre-async/await style. There are two ongoing pieces of work that should simplify them greatly: refactoring to use async/await and building the tracing-backed metrics backend. Where possible, we should prefer advancing one of those two items over doing the hard work of wiring timestamps through the existing structure.

Checklist

This is a rough skeleton of an instrumentation plan. Once we settle on a naming scheme we can go through and expand each item into the actual names of the metrics we want to gather. We can also drop the optional Utilization and Duration bits where they're not relevant.

For now, I've checked off the ones added as examples in #1953.

Sources

docker

file

journald

kafka

socket

stdin

syslog

vector

Transforms

add_fields

add_tags

ansi_stripper

aws_ec2_metadata

coercer

concat

field_filter

grok_parser

json_parser

log_to_metric

logfmt_parser

lua

merge

regex_parser

remove_fields

remove_tags

rename_fields

sampler

split

swimlanes

tokenizer

Sinks

aws_cloudwatch_metrics

aws_kinesis_firehose

aws_kinesis_streams

aws_s3

blackhole

clickhouse

console

datadog_metrics

elasticsearch

gcp_cloud_storage

gcp_pubsub

gcp_stackdriver_logging

http

humio_logs

influxdb_metrics

kafka

logdna

loki

new_relic_logs

prometheus

sematext

socket

splunk_hec

statsd

vector

Hoverbear commented 4 years ago

@lukesteensen Can you help me understand the status of this ticket? It hasn't seen activity in 3 weeks.

lukesteensen commented 4 years ago

@Hoverbear This is step 3 in the plan of attack as laid out in the RFC. Once #1953 is merged, we will update this issue with a list of events to be added.

lukesteensen commented 4 years ago

The initial implementation has been merged! :tada:

So far, the "spec" had been pretty minimal:

  1. Each component (source, transform, or sink), should add to the events_processed counter for each event it encounters. Where possible, it should use component_kind ("source", "transform", or "sink") and component_type (e.g. "syslog" or "file") labels for that counter.

  2. Transforms that can drop events in certain circumstances (e.g. missing field, failed regex match) should add to a processing_errors counter with the same labels as above and an additional error_type label.

  3. Other types of errors should increment their own counters, also with the relevant component labels.

These all focus on the metrics that are generated via events, since those are user-facing and need to be consistent. We will expand this moving forward, including more patterns for the events themselves.

Alexx-G commented 4 years ago

Currently I'm using 0.8.2 (hopefully will have bandwidth for upgrade to 0.9.0 in a week or so) and while it's enough for a validation phase, it's good to see there's going to be a more detailed monitoring. Recently I had a pretty unpleasant surprise and investigation that involved fluent* stack, and my requirements for metrics mostly derive from this experience. There was a significant discrepancy between logs emitted by a source (a huge number of k8s pods) and logs that actually got indexed by the output (splunk). The "high availability config" that is recommended by fluentd adds more problems for such investigations, since it adds a possible point of failure. Thus, it's quite important to be able to validate that the number of collected events matches the number records successfully emitted by the sink (with possibility to take into account events dropped by a transform). Another useful thing is being able to input/output traffic and being able to compare it to events count. Yet another problem is that all log forwarders usually exclude their own logs (for obvious reasons), thus it's quite important to have any internal errors (that usually go to logs) represented as some metrics that can be used to define alerts. Since normally once the logging operator is deployed and validated, its logs are checked only when there are some observable problems, thus its really important to be able to define alerts for errors that are logged (e.g. sink X returned Y errors during last 30min).

IIRC proposed metrics cover almost all things I would need for a similar investigation with vector. Not sure about topologies other than "distributed", but since all components share same metrics, I guess it's a question of labels.

binarylogic commented 4 years ago

Thanks @Alexx-G, that's helpful. And what you described will be possible with our approach. At the minimum, Vector will expose 2 counts:

  1. events_processed{component_type, component_name}
  2. events_errored{component_type, component_name, type, dropped}

These names and labels will change. That's what we're working through. If you have any other requirements please let us know. I'll also ping you on the PR that introduces all of this.

Alexx-G commented 4 years ago

I've checked exactly what metrics helped a lot with fluentd investigation:

There's a metric that might not be required but helps a lot to do some fine tuning and avoid hitting hard limits on events destination:

Also, the ability to define custom metrics (e.g. an event counter for a specific source/transform/sink) and add it to built-in metrics is highly valuable. In my case the problem was in lack of auto-scaling for "log aggregators", however I needed few custom metrics to confirm it and find a temporary solution.

IIRC all of these, except flush retries and queue length are covered by this spec. I didn't get a chance to check it yet, but "log_to_metric" transform seems to cover the "custom metric" use case.

binarylogic commented 4 years ago

That's very helpful. We'll take all of that into account when defining all of the metrics.

Alexx-G commented 4 years ago

@binarylogic One quick question, is there a metric (existing or planned) to track exceeded rate limit? I noticed that we're receiving Apr 28 07:08:36.116 TRACE sink{name=splunk type=splunk_hec}: tower_limit::rate::service: rate limit exceeded, disabling service , thus it should be used for a dashboard for sure.

binarylogic commented 4 years ago

Definitely. We're also addressing the higher level problem of picking rate limits with https://github.com/timberio/vector/pull/2329.

Alexx-G commented 4 years ago

Oh, I somehow missed rate being mentioned so many times in that comment with metrics attack plan. Forgive my ignorance :) Awesome, thanks! I'm following that RFC, it's quite promising.

Alexx-G commented 4 years ago

Hey @binarylogic @lukesteensen , I can lend a hand on adding rate/errors counters to some sinks (splunk aws_kinesis, prometheus) and maybe for some transforms. Luke has done a great job in the initial implementation, thus it should be easy to contribute. Do you mind if I create a PR for a couple of components?

lukesteensen commented 4 years ago

@Alexx-G that would be wonderful, thank you!

binarylogic commented 4 years ago

Closing this since it will be superseded by the work in #3192. We've since switched to an event driven system, and we need specific issues for implementing the remaining events. We are defining the remaining work now.