configuration: ability to populate variables from files

hhromic commented 2 years ago

Current Vector Version

vector 0.17.3 (x86_64-unknown-linux-gnu d72c6e7 2021-10-21)

Use-cases

Sometimes, Vector pipelines require to use sensitive data, i.e. secrets, often in sinks configurations. In containerised applications, it is very frequent for orchestrators to provide these secrets as files mounted in special locations for the application to read them, i.e. /run/secrets/some-secret.

At the moment, Vector itself does not provide a general native functionality for reading files containing secrets, relying instead on these secrets to be either hard-coded into the pipeline configuration or to use environment variable interpolation (provided that these variables are pre-populated from files before starting Vector).

Note that some sinks DO provide support for reading credentials from files (for example AWS- and GCP-based sinks). However, there are some other sink configurations involving secrets that cannot be provided via files. Below are some examples.

Sinks using access keys:

Sinks using passwords:

Sinks using tokens:

Sinks using TLS:

Attempted Solutions

A very common solution is to use environment variables that are pre-populated from files before starting Vector. Then, these variables can be interpolated wherever they are necessary in the Vector pipeline configuration.

For example the following script could be used as a Docker image entrypoint:

#!/usr/bin/env bash
if [[ -n ${VECTOR_SINK_TOKEN_FILE:-} ]]; then
  export VECTOR_SINK_TOKEN=$(< $VECTOR_SINK_TOKEN_FILE)
fi
if [[ -n ${VECTOR_SINK_TLS_KEY_PASS_FILE:-} ]]; then
  export VECTOR_SINK_TLS_KEY_PASS=$(< $VECTOR_SINK_TLS_KEY_PASS_FILE)
fi

exec /path/to/vector "$@"

Afterwards, these variables can be simply used in the pipeline:

sinks:
  http:
    auth:
      token: ${VECTOR_SINK_TOKEN}
    tls:
      key_pass: ${VECTOR_SINK_TLS_KEY_PASS}

While this works, it is clear that is not very scalable and, more importantly, requires building a custom Docker image for Vector which includes a pre-population entrypoint script like the above.

Proposal

One solution could be to implement configuration options in the relevant sinks that take filenames for reading the secrets from, pretty much like some sinks already do for reading credential files.

However, I believe that a better (more scalable) solution would be if Vector is able to read configuration variables from files natively. For example, scan the environment for variables suffixed with _FILE and, for each, read the file pointed by their value and populate a variable named without said suffix. In other words, if there is a variable SOME_TOKEN_FILE=/run/secrets/token in the environment, then read the file pointed by it and populate a variable named SOME_TOKEN (if it doesn't exist already). Then, this variable can be further interpolated like any other variable in the pipeline configuration. This is pretty much what the wrapper script shown above tries to do.

An alternative to the above approach could be to adopt instead the variable expansion approach from Grafana. In their approach, they have separated "environment" and "file" providers for variables:

Environment provider: logs = $__env{LOGDIR}/grafana
- This provider can be used with syntactic sugar as well: ${LOGDIR}
File provider: password = $__file{/etc/secrets/gf_sql_password}
Vault provider: password = $__vault{kv:secret/grafana/admin_defaults:password}

Thus, a convention such as $__file{PATH_TO_FILE} could be adopted to easily and explicitly indicate where in the pipeline configuration we want the contents of a file to be used. Furthermore, if in the future Vector is interested on integrating with Vault services such as Hashicorp Vault (like Grafana does already), then this approach would be very fit as well.

References

vimalk78 commented 2 years ago

Using environment variables to populate the access key, password etc values is more of a security issue than scalability issue.

i like the variable expansion approach of grafana using file provider

JeanMertz commented 2 years ago

requires building a custom Docker image for Vector which includes a pre-population entrypoint script like the above.

It might be worth tracking this in a separate issue. I've seen images that allow you to provide a custom command, such that you can mount an init script, run it, and have it run Vector itself.

vimalk78 commented 2 years ago

in kubernetes environments, the container's environment can be created at deploy time by a kubernetes operator. so we dont really need a custom image

hhromic commented 2 years ago

Kubernetes is not the only orchestrator around. For instance, in our team we use Docker in Swarm mode which is perfect for our use cases. I strongly suggest that any feature related to this in Vector remains as independent of Kubernetes as possible.

hhromic commented 2 years ago

requires building a custom Docker image for Vector which includes a pre-population entrypoint script like the above.

It might be worth tracking this in a separate issue. I've seen images that allow you to provide a custom command, such that you can mount an init script, run it, and have it run Vector itself.

Implementing a wrapper script is just one solution that I wanted to use as example.

After giving more thought to this during the weekend, I'm getting convinced that actually the Grafana approach (having "providers" for the values of variables) is actually very flexible and powerful, and relatively simple to implement imho.

How about going into this direction?

hhromic commented 2 years ago

@binarylogic binarylogic added domain: config domain: security labels 20 hours ago

I would rather consider this issue more about usability than security. If a "providers" approach is decided upon, then it will have the added benefit of providing secure means to pass secrets, i.e. Vaults, which is nice. But in my opinion, the real added value of the feature is on usability, i.e. making life easier to obtain values for variables not only from the environment.

binarylogic commented 2 years ago

@hhromic thank you for this excellent issue. We are prioritizing this work and plan to have it done in March. We're going to start with the same approach that the Datadog Agent takes. It seems similar to what you've outlined here. Thoughts on that?

hhromic commented 2 years ago

@binarylogic thanks for making this happen! Our team is finding itself more and more in the need of reading secrets from files recently and this would come very handy for us.

I took a look a the Datadog Agent secret management system you pointed out and it looks quite neat and flexible! Very powerful indeed. I think the ability to run scripts/executables to retrieve secrets can cover practically any kind of setup.

It would be nice if the Vector Docker images could come ready with the same helper script for autodiscovery, namely:

Starting with version 7.32.0, the helper script is available in the Docker image as /readsecret_multiple_providers.sh, and you can use it to fetch secrets from files and Kubernetes secrets.

That script would be really helpful for many common scenarios.

vectordotdev / vector