influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.71k stars 5.59k forks source link

Inconsistent behaviour in Kubernetes with uuids stored in secrets and loaded as environment variables #5186

Closed guilhemmarchand closed 5 years ago

guilhemmarchand commented 5 years ago

Relevant telegraf.conf:

  telegraf.conf: |+
    [global_tags]
      env = "$ENV"
    [agent]
      hostname = "$POD_NAME"
    [[outputs.http]]
      url = "https://$SPLUNK_HEC_URL/services/collector"
      insecure_skip_verify = true
      data_format = "splunkmetric"
      splunkmetric_hec_routing = true
      [outputs.http.headers]
        Content-Type = "application/json"
        Authorization = "Splunk $SPLUNK_HEC_TOKEN"
        X-Splunk-Request-Channel = "$SPLUNK_HEC_TOKEN"
    # Kafka JVM monitoring
    [[inputs.jolokia2_agent]]
      name_prefix = "kafka_"
      urls = ["http://$POD_NAME:8778/jolokia"]
    [[inputs.jolokia2_agent.metric]]
      name         = "controller"
      mbean        = "kafka.controller:name=*,type=*"
      field_prefix = "$1."

System info:

Kubernetes deployment with latest Telegraf container version 1.9.1

Steps to reproduce:

  1. Create Kubernetes pairs of secrets, in my example I have a Splunk URL stored as a secret and a Splunk HTTP Event Collector token which is a uuid stored in a second secret data.

Ex:

https://github.com/guilhemmarchand/splunk-guide-for-kafka-monitoring/tree/master/kubernetes-yaml-examples/kafka-brokers

Generate the base64 values:

echo "splunk_hec.mydomain.com" | base64
c3BsdW5rX2hlYy5teWRvbWFpbi5jb20K
echo -n '65735c4b-f277-4f69-87ca-ff2b738c69f9' | base64
NjU3MzVjNGItZjI3Ny00ZjY5LTg3Y2EtZmYyYjczOGM2OWY5

Create and apply your secrets:

../../yaml_git_ignored/splunk_secrets.yml
apiVersion: v1
kind: Secret
metadata:
  name: splunk-secrets
  namespace: kafka
type: Opaque
data:
  splunk_hec_url: "c3BsdW5rX2hlYy5teWRvbWFpbi5jb20K"
  splunk_hec_token: "NjU3MzVjNGItZjI3Ny00ZjY5LTg3Y2EtZmYyYjczOGM2OWY5"

Create:

kubectl create -f ../../yaml_git_ignored/splunk_secrets.yml

  1. Create a pod that uses the secrets

ConfiMap:

https://github.com/guilhemmarchand/splunk-guide-for-kafka-monitoring/blob/master/kubernetes-yaml-examples/kafka-brokers/01-telegraf-config-kafka-brokers.yml

example:

kind: ConfigMap
metadata:
  name: telegraf-config-kafka-brokers
  namespace: kafka
apiVersion: v1
data:

  telegraf.conf: |+
    [global_tags]
      env = "$ENV"
    [agent]
      hostname = "$POD_NAME"
    [[outputs.http]]
      url = "https://$SPLUNK_HEC_URL/services/collector"
      insecure_skip_verify = true
      data_format = "splunkmetric"
      splunkmetric_hec_routing = true
      [outputs.http.headers]
        Content-Type = "application/json"
        Authorization = "Splunk $SPLUNK_HEC_TOKEN"
        X-Splunk-Request-Channel = "$SPLUNK_HEC_TOKEN"
    # Kafka JVM monitoring
    [[inputs.jolokia2_agent]]
      name_prefix = "kafka_"
      urls = ["http://$POD_NAME:8778/jolokia"]
    [[inputs.jolokia2_agent.metric]]
      name         = "controller"
      mbean        = "kafka.controller:name=*,type=*"
      field_prefix = "$1."

Pod definition:

https://github.com/guilhemmarchand/splunk-guide-for-kafka-monitoring/blob/master/kubernetes-yaml-examples/kafka-brokers/04-patch-kafka-brokers-statefulset.yml

# meant to be applied using
# kubectl --namespace kafka patch statefulset kafka --patch "$(cat 04-patch-kafka-brokers-statefulset.yml )"
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: confluent-oss-cp-kafka
  namespace: kafka
spec:
  template:
    spec:
      containers:
      - name: telegraf
        image: docker.io/telegraf:latest
        resources:
          requests:
            cpu: 10m
            memory: 60Mi
          limits:
            memory: 120Mi
        env:
        - name: HOSTNAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: SPLUNK_HEC_URL
          valueFrom:
            secretKeyRef:
              name: splunk-secrets
              key: splunk_hec_url
        - name: SPLUNK_HEC_TOKEN
          valueFrom:
            secretKeyRef:
              name: splunk-secrets
              key: splunk_hec_token
        volumeMounts:
        - name: telegraf-config-kafka-brokers
          mountPath: /etc/telegraf
      volumes:
      - name: telegraf-config-kafka-brokers
        configMap:
          name: telegraf-config-kafka-brokers

Expected behavior:

The container should start normally and be able to use the values from the secrets.

Actual behavior:

The container will fail to start due to a toml parsing error that happens ONLY with the uuid value stored in my example within the environment variable $SPLUNK_HEC_TOKEN.

E! [telegraf] Error running agent: Error parsing /etc/telegraf/telegraf.conf, toml: line xx: parse error

The environment variables are available in the container as they should:

env | grep SPLUNK
SPLUNK_HEC_TOKEN=65735c4b-f277-4f69-87ca-ff2b738c69f9
SPLUNK_HEC_URL=xxxxxxxxx.compute.amazonaws.com:8088

Using the $SPLUNK_HEC_URL in my example is not an issue, but as soon as I try to use the uuid stored in the $SPLUNK_HEC_TOKEN, this results in a tolm parsing error.

What is VERY strange is that if I re-export the environment variable:

export SPLUNK_HEC_TOKEN="65735c4b-f277-4f69-87ca-ff2b738c69f9"

And then apply the configuration in a temp copy of telegraf.conf, this will work correctly.

This issue is ONLY happening at container startup when the environment variable is defined automatically by Kubernetes AND when the environment variable contains a uuid value.

This can be reproduced by just loading the variable in the [global_tags]

THIS WORKS OK:

[global_tags]
  env = "$ENV"
  splunk_hec_url = "$SPLUNK_HEC_URL"
[agent]
  hostname = "$POD_NAME"
[[outputs.http]]
  ## URL is the address to send metrics to
  url = "https://xxxxxxxxxxxxxxx.compute.amazonaws.com:8088/services/collector"
  ## Use TLS but skip chain & host verification
  insecure_skip_verify = true
  data_format = "splunkmetric"
  splunkmetric_hec_routing = true
  ## Additional HTTP headers
  [outputs.http.headers]
    Content-Type = "application/json"
    Authorization = "Splunk 205d43f1-2a31-4e60-a8b3-327eda49944a"
    X-Splunk-Request-Channel = "205d43f1-2a31-4e60-a8b3-327eda49944a"

  # zookeeper metrics
  [[inputs.zookeeper]]
    servers = ["$POD_NAME:2181"]

MANUAL RUN:

root@confluent-oss-cp-zookeeper-2:/# telegraf --config /tmp/telegraf.conf --test
2018-12-24T10:20:33Z I! Starting Telegraf 1.9.1
> zookeeper,env=$ENV,host=confluent-oss-cp-zookeeper-2,port=2181,server=confluent-oss-cp-zookeeper-2,splunk_hec_url=xxxxxxxxxxxxxxx-2.compute.amazonaws.com:8088,state=leader approximate_data_size=20081i,avg_latency=2i,ephemerals_count=4i,followers=2i,fsync_threshold_exceed_count=0i,last_proposal_size=-1i,max_file_descriptor_count=1048576i,max_latency=39i,max_proposal_size=-1i,min_latency=0i,min_proposal_size=-1i,num_alive_connections=2i,open_file_descriptor_count=124i,outstanding_requests=0i,packets_received=45i,packets_sent=44i,pending_syncs=0i,synced_followers=2i,version="3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03",watch_count=0i,znode_count=216i 1545646834000000000

THIS FAILS:

[global_tags]
  env = "$ENV"
  splunk_hec_token = "$SPLUNK_HEC_TOKEN"
[agent]
  hostname = "$POD_NAME"
[[outputs.http]]
  ## URL is the address to send metrics to
  url = "https://xxxxxxxxxxxxxxxx.compute.amazonaws.com:8088/services/collector"
  ## Use TLS but skip chain & host verification
  insecure_skip_verify = true
  data_format = "splunkmetric"
  splunkmetric_hec_routing = true
  ## Additional HTTP headers
  [outputs.http.headers]
    Content-Type = "application/json"
    Authorization = "Splunk 205d43f1-2a31-4e60-a8b3-327eda49944a"
    X-Splunk-Request-Channel = "205d43f1-2a31-4e60-a8b3-327eda49944a"

  # zookeeper metrics
  [[inputs.zookeeper]]
    servers = ["$POD_NAME:2181"]

MANUAL START:

root@confluent-oss-cp-zookeeper-2:/# telegraf --config /tmp/telegraf.conf --test
2018-12-24T10:21:45Z I! Starting Telegraf 1.9.1
2018-12-24T10:21:45Z E! [telegraf] Error running agent: Error parsing /tmp/telegraf.conf, toml: line 4: parse error

THIS WORKS IF VARIABLE IS MANUALLY RE-EXPORTED:

root@confluent-oss-cp-zookeeper-2:/# export SPLUNK_HEC_TOKEN=65735c4b-f277-4f69-87ca-ff2b738c69f9
root@confluent-oss-cp-zookeeper-2:/# env | grep SPLUNK
SPLUNK_HEC_TOKEN=65735c4b-f277-4f69-87ca-ff2b738c69f9
SPLUNK_HEC_URL=xxxxxxxxxxxxxx.compute.amazonaws.com:8088
root@confluent-oss-cp-zookeeper-2:/# telegraf --config /tmp/telegraf.conf --test
2018-12-24T10:22:37Z I! Starting Telegraf 1.9.1
> zookeeper,env=$ENV,host=confluent-oss-cp-zookeeper-2,port=2181,server=confluent-oss-cp-zookeeper-2,splunk_hec_token=65735c4b-f277-4f69-87ca-ff2b738c69f9,state=leader approximate_data_size=20081i,avg_latency=3i,ephemerals_count=4i,followers=2i,fsync_threshold_exceed_count=0i,last_proposal_size=-1i,max_file_descriptor_count=1048576i,max_latency=74i,max_proposal_size=-1i,min_latency=0i,min_proposal_size=-1i,num_alive_connections=2i,open_file_descriptor_count=124i,outstanding_requests=0i,packets_received=70i,packets_sent=69i,pending_syncs=0i,synced_followers=2i,version="3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03",watch_count=0i,znode_count=216i 1545646958000000000

Additional info:

glinton commented 5 years ago

Can you verify there aren't any newlines getting appended to the uuid when it's initially generated/set? Where the toml parse error is on line 4, it seems like the value is taking two lines.

Reproducible with the following:

[global_tags]
  env = "$ENV"
  splunk_hec_token = "1234-1234-1234-1234
"
[agent]
guilhemmarchand commented 5 years ago

Hi @glinton

Thanks, it is very suspicious and your explanation would make sense so I re-tested again, and unless I am really missing something here, I still see the same issue.

If I move from secrets to a simple configMap, then I have no issues at all, aka:

apiVersion: v1
kind: ConfigMap
metadata:
  namespace: kafka
  name: global-config
data:
  env: my-environment
  splunk_hec_url: xxxxxxxxx.eu-west-2.compute.amazonaws.com:8088
  splunk_hec_token: 205d43f1-2a31-4e60-a8b3-327eda49944a

Notes: With or without an extra line at the end of the yaml file, I see not changes at all

Then in my pod definition:

# meant to be applied using
# kubectl --namespace kafka patch statefulset zookeeper --patch "$(cat 02-patch-zookeeper-statefulset.yml )"
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: confluent-oss-cp-zookeeper
  namespace: kafka
spec:
  template:
    spec:
      containers:
      - name: telegraf
        image: docker.io/telegraf:latest
        resources:
          requests:
            cpu: 10m
            memory: 60Mi
          limits:
            memory: 120Mi
        env:
        - name: ENV
          valueFrom:
            configMapKeyRef:
              name: global-config
              key: env
        - name: HOSTNAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: SPLUNK_HEC_URL
          valueFrom:
            configMapKeyRef:
              name: global-config
              key: splunk_hec_url
        - name: SPLUNK_HEC_TOKEN
          valueFrom:
            configMapKeyRef:
              name: global-config
              key: splunk_hec_token
        volumeMounts:
        - name: telegraf-config-zookeeper
          mountPath: /etc/telegraf
      volumes:
      - name: telegraf-config-zookeeper
        configMap:
          name: telegraf-config-zookeeper

This works perfectly fine.

If I only switch my "splunk_hec_token" value in a secret:

apiVersion: v1
kind: Secret
metadata:
  name: splunk-secrets
  namespace: kafka
type: Opaque
data:
  splunk_hec_token: MjA1ZDQzZjEtMmEzMS00ZTYwLWE4YjMtMzI3ZWRhNDk5NDRhCg==

Notes: With or without extra line at the end of the yaml, does not change anything

Then the pod will fail and I get the toml parsing issue.

I have triple checked and I can't find an explanation.

danielnelson commented 5 years ago

I think a build with some debug logging could help with understanding what is happening, @guilhemmarchand are you able to compile Telegraf if I show you what to change?

guilhemmarchand commented 5 years ago

Hi @danielnelson

Sure thing yes, if I can I will be happy to help.

danielnelson commented 5 years ago

If you apply this patch then Telegraf will spit out the environment variables and each config file after env var replacement as they are parsed:

diff --git a/internal/config/config.go b/internal/config/config.go
index 469b80ad..596a1a17 100644
--- a/internal/config/config.go
+++ b/internal/config/config.go
@@ -785,12 +785,14 @@ func parseConfig(contents []byte) (*ast.Table, error) {
        env_vars := envVarRe.FindAll(contents, -1)
        for _, env_var := range env_vars {
                env_val, ok := os.LookupEnv(strings.TrimPrefix(string(env_var), "$"))
+               fmt.Printf("%s=%s (set: %t)\n", env_var, env_val, ok)
                if ok {
                        env_val = escapeEnv(env_val)
                        contents = bytes.Replace(contents, env_var, []byte(env_val), 1)
                }
        }

+       fmt.Println(string(contents))
        return toml.Parse(contents)
 }
danielnelson commented 5 years ago

Closing, but please let me know if you weren't able to debug this issue.