kyma-project / kyma

Kyma is an opinionated set of Kubernetes-based modular building blocks, including all necessary capabilities to develop and run enterprise-grade cloud-native applications.
https://kyma-project.io
Apache License 2.0
1.52k stars 405 forks source link

PoC OpenTelemetry - general setup #10873

Closed a-thaler closed 3 years ago

a-thaler commented 3 years ago

Description

That is the first piece to solve https://github.com/kyma-project/kyma/issues/10119

Motivation:

High-level outcome:

Technical goals:

Attachments

suleymanakbas91 commented 3 years ago

OpenTelemetry Collector

The Collector is a single binary that can be configured either as an Agent or as a Gateway. The Agent is supposed to do the light-weight job of only collecting the data and sending it to the Gateway, whereas the Gateway can do more advanced, heavy-lifting filtering operations (similar to FluentBit/Fluentd setup).

Deployment Type

There are three different deployment modes: DaemonSet, Deployment (default), sidecar.

We can first start with the Agent as DaemonSet setup, and see if it'd be sufficient. If not, we can also add a Gateway as Deployment and move the heavy-lifting parts there.

There is an OpenTelemetry Operator or a Helm chart to deploy the Collector. The Operator will make the configuration easier using the CRDs, but it is yet another piece of software to maintain. That's why I think deploying the Helm chart option would be a less-troublesome start for us.

Configuration

There are three different concepts in the configuration for the data collection/modification, which are called receivers, processors, and exporters. They are just like inputs, filters, and outputs in FluentBit configuration.

They do not take any effect until they are used in a pipeline though. A pipeline consists of a set of receivers, processors and exporters, and they are the execution recipes. Each pipeline can be of type traces, metrics, or logs. There can be multiple pipelines of the same type.

There are also extensions to provide further functionality. They need to be defined in the extension field to take effect, and are available primarily for tasks that do not involve processing telemetry data. Examples of extensions include health monitoring, service discovery, and data forwarding. Extensions are optional.

In the end, a sample configuration looks like this:

receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  batch:

exporters:
  otlp:
    endpoint: otelcol:55680

extensions:
  health_check:
  pprof:
  zpages:

service:
  extensions: [health_check,pprof,zpages]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp]

Note: Dynamic loading of config is not possible at the moment and not planned for GA. There is this nice proposal to allow remote configuration of Collectors, which can also be useful for our case.

Processors

Processors are used to modify or to filter telemetry data. We can use Filter Processor for metrics and Span Processor for traces, to filter data based on regexes. We can also make use of Memory Limiter Processor to prevent out of memory issues. Additionally, Batch Processor would be useful for compressing the data.

FluentBit Subprocess Extension

There is an extension called FluentBit Subprocess Extension that runs FluentBit as a subprocess. It either sends the collected logs to OpenTelemetry Collector using a Forward plugin or uses outputs defined in FluentBit configuration. We can make use of this extension to have only one agent Pod running on each Node instead of two separate FluentBit and Collector Pods.

Prometheus Exporter vs Prometheus Remote Write Exporter

There are two different exporters for Prometheus. Prometheus Exporter creates an endpoint for Prometheus to scrape the collected metrics, and Prometheus Remote Write Exporter sends the collected metrics directly to an external Prometheus-compatible backend like Cortex. Using Prometheus RW Exporter would free us from having a Prometheus on the cluster.

The OTLP Protocol

OpenTelemetry Protocol (OTLP) specification describes the encoding, transport, and delivery mechanism of telemetry data between telemetry sources, intermediate nodes such as collectors and telemetry backends. Using the sidecar or the DaemonSet deployment mode results in transforming the app specific data into the OTLP so that the Collector receives input in a consistent format. With that, we can easily interchange the whole shipment part with something else.

Downsides

A lot of moving pieces, there are warning messages on every page regarding possible changes and removals.

Example

The easiest way to see the Collector in action is to follow this blog post.

suleymanakbas91 commented 3 years ago

Results will be consolidated with the other PoC results in the community repo.