vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.51k stars 1.53k forks source link

Add `kubernetes_metadata` transform #5077

Open MOZGIII opened 3 years ago

MOZGIII commented 3 years ago

Motivation

We already have a kubernetes_logs source that collects the Pod logs in the Kubernetes environment, and it covers all of the common use cases.

However, the Kubernetes ecosystem is huge, and advanced users also often have are a lot of uncommon use cases. We can't possibly provide first-class support for all of them, but we can empower users with the right tools to tailor Vector for their unique needs.

The main concern users have in the Kubernetes environment in relation to log events is enriching the events with the relevant data from the Kubernetes state - things like the name of the Pod the event is originating from. As mentioned above, this is already covered by the kubernetes_logs - but only for the events from the kubernetes_logs source.

So far, we have recognized a number of cases that we want to support, but don't want to include in the kubernetes_logs source:

  1. Sidecar deployments

    Deploying Vector as a sidecar (a secondary container within a Pod). This is usually used when there's an app doesn't write its logs to stdout, and uses files instead. In this operation mode, Vector would typically need to be used with a file source, and will have to fetch the information about the Pod it runs as (and only that Pod!) from the kube-apiserver and annotate all the events from the file source with the Pod metadata.

    Refs:

  2. Cluster that uses journald for logs

    When using Docker as Kubernetes container runtime, it is possible to configure Docker to use the journald log driver. With this configuration, logs won't be available as files on disk, and the kubernetes_logs source won't be usable. There are myriads of possible non-standard configurations like this, so we don't want to include support for them at the kubernetes_logs source - first of all to keep things simple for the users that are on the standard use case, but also because it is virtually impossible to support of all the configurations while adding the flexibility in there significantly increases the maintenance required maintenance efforts. In other words - supporting this use case via a transform makes the most sense.

    In this operation mode, Vector is deployed on each node (the recommended way is still to do it via via vector-agent Helm chart in this case), and a journald source is used in conjunction with an annotating transform. This way, a similar outcome can be achieved as when using the kubernetes_logs source, while the kubernetes_logs source is not used.

    Refs:

Requirements

To be able to cover all of the use cases, we have to build a solution that is quite flexible.

To reduce the load, we should use the same state-sync architecture that we use at the kubernetes_logs source, however it needs to be more user-configurable and thus more generic at the code level.

We want to support:

What would be great to support eventually:

Design considerations

Configurability and defaults

The solution has to be very configurable, aimed at the advanced users, and designed to cover edge use cases, and it means there are very few defaults that we could sanely apply. This is contrary to the kubernetes_logs source, which was designed to work out of the box with minimal configuration and be a solid solution to the one most common use case.

We should still at least try to make the configuration as easy and intuitive as possible.

Use of generics

Due to the nature of the task, we'll likely have to build most of the code around generic primitives like k8s_openapi::Resource and serde::de::DeserializeOwned (not sure if the names are precise but you get the idea), rather than using concrete types like k8s_openapi::[...]::Pod.

QA

To ensure the proper quality, we'll have to cover the implementation with both E2E tests and the unit tests.

The E2E tests can consist of just two cases as a start:

  1. A simple case of reading the files via the file source and annotating the events similar to the kubernetes_logs source.
  2. A test simulating the sidecar deployment, where we Vector is configured to generate events and annotate them with the Kubernetes state of its own Pod.

Ideally, we'd want to have more test scenarios, but we can add more as we go.

Proposed implementation plan

  1. Implement a generic resource (AnyResource) to be able to work with arbitrary Kubernetes resources. It must implement k8s_openapi::Resource and serde::de::DeserializeOwned.
  2. Implement a generic configurable watch request builder to be able to build the arbitrary watch requests to the Kubernetes API as configured by the user.
  3. Implement a state layer that would allow quick lookups by the user-configurable lookup fields (aka configurable indexer).
  4. Implement a configurable annotator to fill-in arbitrary user-configured fields from an arbitrary resource into the event.
  5. Implement a configurable event dropping, a mechanism to allow users to drop events based on some predicate rules.
  6. Tie all this together in a transform.

Open questions

  1. Do we implement our own custom predicate logic for dropping events, or leave it and request that users just use the reduce/remap transform / etc?
binarylogic commented 3 years ago

Closing, see https://github.com/timberio/vector/pull/5317#issuecomment-762377958.

jszwedko commented 5 months ago

Reopening this just to track additional use-cases / reports like https://github.com/vectordotdev/vector/issues/20366