open-telemetry / opentelemetry-operator

Kubernetes Operator for OpenTelemetry Collector
Apache License 2.0
1.23k stars 442 forks source link

metricstransform Regexp Config Causes Startup Failure in v0.107.0 #3262

Open Zurvarian opened 3 months ago

Zurvarian commented 3 months ago

Describe the bug

Same as with https://github.com/open-telemetry/opentelemetry-collector/issues/10585 but I'm reproducing the issue with version v0.107.0 which supposedly has a fix for it.

Something to notice is that I'm running the Collector as a SideCar created with the Operator, not sure if that affects it somehow.

Steps to reproduce

Create a sidecar with the Operator with a config that has variable replacement of a regexp.

What did you expect to see?

The variable is replaced correctly

What did you see instead?

An error like cannot resolve the configuration: environment variable "1" has invalid name: must match regex ^[a-zA-Z_][a-zA-Z0-9_]*$

What version did you use?

v0.107.0

What config did you use?

processors:
      metricstransform:
        transforms:
          - include: (.*)$$
            match_type: regexp
            action: update
            new_name: knative.dev_eventing.$${1}

Environment

Linux in GCP.

Additional context

TylerHelmuth commented 3 months ago

Please try v0.108.0

Zurvarian commented 3 months ago

I'll do.

Zurvarian commented 3 months ago

Trying with version v0.108.0 of contrib fixed the issue, thanks!

Zurvarian commented 3 months ago

UPDATE: Sorry, something was misconfigured in my OTEL Collector and that's why it looked like it worked.

It is still failing

TylerHelmuth commented 3 months ago

@Zurvarian your configuration works for with v0.108.0 when I try it. Please retry your test, ensuring that the version being used is v0.108.0. If you are building a customer collector binary using the Otel Collector Builder, make sure you are using the 0.108.0 version of that as well.

Zurvarian commented 3 months ago

Hello,

I've verified the versions and checked everything and I get the same results. To add more context:

This is the SideCar pod where you can see the Collector distribution and version I'm using (otel/opentelemetry-collector-contrib:0.108.0)

 - args:
    - --feature-gates=-component.UseLocalHostAsDefaultHost
    - --config=env:OTEL_CONFIG
    env:
    - name: POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: OTEL_CONFIG
      value: |
        receivers:
          prometheus_simple:
            collection_interval: 10s
            endpoint: localhost:9090
            use_service_account: true
        exporters:
          otlp:
            endpoint: opentelemetry-collector.monitoring:4317
            tls:
              insecure: true
        processors:
          metricstransform:
            transforms:
              - action: update
                include: (.*)$$
                match_type: regexp
                new_name: knative.dev_eventing.$${1}
        service:
          pipelines:
            metrics:
              exporters:
                - otlp
              processors:
                - metricstransform
              receivers:
                - prometheus_simple
    - name: OTEL_RESOURCE_ATTRIBUTES_POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: OTEL_RESOURCE_ATTRIBUTES_POD_UID
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.uid
    - name: OTEL_RESOURCE_ATTRIBUTES_NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
    - name: OTEL_RESOURCE_ATTRIBUTES
      value: k8s.deployment.name=kafka-broker-dispatcher,k8s.deployment.uid=f5ef223e-4ca2-4264-bfe6-26d712629b9e,k8s.namespace.name=knative-eventing,k8s.node.name=$(OTEL_RESOURCE_ATTRIBUTES_NODE_NAME),k8s.pod.name=$(OTEL_RESOURCE_ATTRIBUTES_POD_NAME),k8s.pod.uid=$(OTEL_RESOURCE_ATTRIBUTES_POD_UID),k8s.replicaset.name=kafka-broker-dispatcher-796dfff45d,k8s.replicaset.uid=960361fe-ae94-4edf-9dbd-c0e7eff29b72
    image: otel/opentelemetry-collector-contrib:0.108.0
    imagePullPolicy: IfNotPresent
    name: otc-container
    ports:
    - containerPort: 8888
      name: metrics
      protocol: TCP
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-9x4sj
      readOnly: true

This is the log I'm getting back: otc-container-logs

And this is Collector configuration applied:

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"opentelemetry.io/v1beta1","kind":"OpenTelemetryCollector","metadata":{"annotations":{},"name":"knative-eventing-otel-collector","namespace":"knative-eventing"},"spec":{"config":{"exporters":{"otlp":{"endpoint":"opentelemetry-collector.monitoring:4317","tls":{"insecure":true}}},"processors":{"metricstransform":{"transforms":[{"action":"update","include":"(.*)37829","match_type":"regexp","new_name":"knative.dev_eventing.37829{1}"}]}},"receivers":{"prometheus_simple":{"collection_interval":"10s","endpoint":"localhost:9090","use_service_account":true}},"service":{"pipelines":{"metrics":{"exporters":["otlp"],"processors":["metricstransform"],"receivers":["prometheus_simple"]}}}},"mode":"sidecar"}}
  creationTimestamp: "2024-08-28T12:37:35Z"
  finalizers:
  - opentelemetrycollector.opentelemetry.io/finalizer
  generation: 2
  labels:
    app.kubernetes.io/managed-by: opentelemetry-operator
  name: knative-eventing-otel-collector
  namespace: knative-eventing
  resourceVersion: "686767636"
  uid: 24f3f157-4696-4dc3-857f-1a1d2b53ff09
spec:
  args:
    feature-gates: -component.UseLocalHostAsDefaultHost
  config:
    exporters:
      otlp:
        endpoint: opentelemetry-collector.monitoring:4317
        tls:
          insecure: true
    processors:
      metricstransform:
        transforms:
        - action: update
          include: (.*)$$
          match_type: regexp
          new_name: knative.dev_eventing.$${1}
    receivers:
      prometheus_simple:
        collection_interval: 10s
        endpoint: localhost:9090
        use_service_account: true
    service:
      pipelines:
        metrics:
          exporters:
          - otlp
          processors:
          - metricstransform
          receivers:
          - prometheus_simple
  configVersions: 3
  daemonSetUpdateStrategy: {}
  deploymentUpdateStrategy: {}
  ingress:
    route: {}
  ipFamilyPolicy: SingleStack
  managementState: managed
  mode: sidecar
  observability:
    metrics: {}
  podDnsConfig: {}
  replicas: 1
  resources: {}
  targetAllocator:
    allocationStrategy: consistent-hashing
    filterStrategy: relabel-config
    observability:
      metrics: {}
    prometheusCR:
      scrapeInterval: 30s
    resources: {}
  upgradeStrategy: automatic
status:
  version: 0.107.0

Note that here version says 0.107.0. That's the OpenTelemetry Operator version, the latest as of yesterday. However I'm forcing the version 0.108.0 of the Collector in the Helm installation of the Operator.

Thanks in advance.

Zurvarian commented 3 months ago

Something important to remark is that I'm using the Contrib release as opposite to the default one as I need to make use of the extra addons, like prometheus_simple above.

Could it be that the Contrib version lacks of the fix for some reason? Should I open a ticket in the Contrib repository instead to investigate further?

Zurvarian commented 3 months ago

More testing here:

I did force to use v0.104.0 with the feature gate -confmap.unifyEnvVarExpansion as recommended in the old ticket and the issue was still there. Then, I'm using version v0.100.0 and it works.

Zurvarian commented 3 months ago

Adding more context even:

Something is off with the transformer I'm using, when using the version v0.100.0 it does not fail but looking at the debug logs of the collector I can see how the captured value is not appended. Meaning that either my config is wrong somehow or the transformer is unable to capture using regexp :/

bart-braidwell commented 3 months ago

I can see the same issue:

Error: failed to get config: cannot resolve the configuration: environment variable "1" 
has invalid name: must match regex ^[a-zA-Z_][a-zA-Z0-9_]*$
2024/09/02 10:38:25 collector server run finished with error: 
failed to get config: cannot resolve the configuration: environment variable "1" 
has invalid name: must match regex ^[a-zA-Z_][a-zA-Z0-9_]*$
    metricstransform/test:
      transforms:
        - include: test_([^_]+)_(.*)$$
          match_type: regexp
          action: update
          new_name: test.$${1}_$${2}

replacing $${1} or $${2} with any normal string works

using otel contrib v.0.108.0, tried to disable feature gate, but at this version its already stable so cannot be disabled.

TylerHelmuth commented 3 months ago

@Zurvarian @bart-braidwell What collector distribution are you using? If you are building your own, please confirm your version of ocb and share your manifest file.

Zurvarian commented 3 months ago

@TylerHelmuth I'm using the default contrib: otel/opentelemetry-collector-contrib@v0.108.0

GH Repo: https://github.com/open-telemetry/opentelemetry-collector-releases/tree/main/distributions/otelcol-contrib

TylerHelmuth commented 3 months ago

@Zurvarian Are you sure the collector version you tried with is 0.108.0? The operator hasn't been updated yet so you either need to set the image via the OpenTelemetryCollector resource or via the operator startup flags.

Zurvarian commented 3 months ago

@TylerHelmuth let me share my values.yaml that I'm using to install. Originally it was using version 0.107.0 but when you asked me to try 0.108.0 I changed the installation to override the one that came with the template:

opentelemetry-operator:
  # Fullname override is required as otherwise the Operator Deployment will be named after the Helm release.
  fullnameOverride: cloudprogs-opentelemetry-operator
  manager:
    collectorImage:
      # Contrib collector includes all the community extensions, it is bigger, but we need some of the non-default ones.
      repository: otel/opentelemetry-collector-contrib
      # Default version for our current release is v0.107.0 but that one has a bug: https://github.com/open-telemetry/opentelemetry-operator/issues/3262
      tag: 0.108.0
    createRbacPermissions: true

And in the OpenTelemetryCollector SideCar container I can see image: otel/opentelemetry-collector-contrib:0.108.0

TylerHelmuth commented 2 months ago

@bart-braidwell are you using the OpenTelemetry Operator also?

TylerHelmuth commented 2 months ago

I was able to reproduce the bug with the Operator's collector, but not with a standard collector. I need to investigate more to understand the root cause.

TylerHelmuth commented 2 months ago

I was able to install a collector using the collector helm chart with the following config:

  receivers:
      otlp:
        protocols:
          grpc: {}
          http: {}
  processors:
    metricstransform:
      transforms:
        - include: (.*)$$
          match_type: regexp
          action: update
          new_name: knative.dev_eventing.$${1}
  exporters:
    debug:
      verbosity: detailed
  service:
    pipelines:
      metrics:
        receivers: [otlp]
        processors: [metricstransform]
        exporters: [debug]

The same config and collector image used via a OpenTelemetryCollector resource from the operator failed.

The operator is using env var to supply the config, but I tested that delivery method locally and had no issues.

The Operator is definitely doing something here but I haven't found what it is yet. The operator has a release this week to bump to 0.108.0, I will keep investigating after that.

TylerHelmuth commented 2 months ago

I've been able to reproduce in the collector helm chart using this simplified (core only components) values.yaml:

mode: deployment

resources:
  limits:
    cpu: 200m
    memory: 512Mi
  requests:
    cpu: 100m
    memory: 256Mi

image:
  repository: otel/opentelemetry-collector-core
  tag: "0.108.0"

replicaCount: 1

configMap:
  create: false

command:
  extraArgs:
    - "--config=env:OTEL_CONFIG"

extraEnvs:
  - name: OTEL_CONFIG
    value: |
            receivers:
              otlp:
                protocols:
                  grpc:
                  http:
            exporters:
              otlp:
                endpoint: "api.honeycomb.io:443"
                headers:
                  "X-Honeycomb-Team": "$${1}"
            service:
              telemetry:
                logs:
                  level: debug
              pipelines:
                traces:
                  receivers:
                    - otlp
                  exporters:
                    - otlp

I still cannot reproduce the issue locally with an env var. The following works:

OTEL_CONFIG=$(cat ./bin/config.yaml) ./bin/otelcorecol_darwin_arm64 --config=env:OTEL_CONFIG

Where bin/config.yaml is

receivers:
  otlp:
    protocols:
      grpc:
      http:
exporters:
  otlp:
    endpoint: "api.honeycomb.io:443"
    headers:
      "X-Honeycomb-Team": "$${1}"
service:
  telemetry:
    logs:
      level: debug
  pipelines:
    traces:
      receivers:
        - otlp
      exporters:
        - otlp
codeboten commented 2 months ago

I've been able to reproduce in the collector helm chart using this simplified (core only components) values.yaml:

Should this be transfered to the helm-charts/operator repo then @TylerHelmuth?

TylerHelmuth commented 2 months ago

Not yet.

I can see why the problem is happening only in k8s: the input string has its $$ escaped to $. K8s is escaping the $$ to $, so the string that is passed to the collector via the OTEL_CONFIG env var only has ${1} instead of $${1}.

I was able to reproduce this behavior outside the collector with a basic app:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-deployment
spec:
  selector:
    matchLabels:
      app: my-pod
  replicas: 1
  template:
    metadata:
      labels:
        app: my-pod
    spec:
      containers:
        - name: myapp
          image: alpine:latest
          args:
            - "env"
          env:
            - name: OTEL_CONFIG
              value: "$$"

When installed will produce logs similar to:

OTEL_CONFIG=$
TylerHelmuth commented 2 months ago

The bottom of this section specifies that $, which is a key symbol in k8s env var to allow dependent env vars such as value: $(OTHER_ENV_VAR), can be escaped with $$.

So this is not an issue with confmap, but an issue with how our string interacts with kubernetes env var. I believe the fix is for the Operator to do some massaging of the input string when it creates the env var.

TylerHelmuth commented 2 months ago

@Zurvarian @bart-braidwell @cloudchristina the workaround for now is to do extra escaping in your OpenTelemetryCollector resource configs to account for both Kubernetes' escaping and the Collector's escaping. For example:

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"opentelemetry.io/v1beta1","kind":"OpenTelemetryCollector","metadata":{"annotations":{},"name":"knative-eventing-otel-collector","namespace":"knative-eventing"},"spec":{"config":{"exporters":{"otlp":{"endpoint":"opentelemetry-collector.monitoring:4317","tls":{"insecure":true}}},"processors":{"metricstransform":{"transforms":[{"action":"update","include":"(.*)37829","match_type":"regexp","new_name":"knative.dev_eventing.37829{1}"}]}},"receivers":{"prometheus_simple":{"collection_interval":"10s","endpoint":"localhost:9090","use_service_account":true}},"service":{"pipelines":{"metrics":{"exporters":["otlp"],"processors":["metricstransform"],"receivers":["prometheus_simple"]}}}},"mode":"sidecar"}}
  creationTimestamp: "2024-08-28T12:37:35Z"
  finalizers:
  - opentelemetrycollector.opentelemetry.io/finalizer
  generation: 2
  labels:
    app.kubernetes.io/managed-by: opentelemetry-operator
  name: knative-eventing-otel-collector
  namespace: knative-eventing
  resourceVersion: "686767636"
  uid: 24f3f157-4696-4dc3-857f-1a1d2b53ff09
spec:
  args:
    feature-gates: -component.UseLocalHostAsDefaultHost
  config:
    exporters:
      otlp:
        endpoint: opentelemetry-collector.monitoring:4317
        tls:
          insecure: true
    processors:
      metricstransform:
        transforms:
        - action: update
          include: (.*)$$
          match_type: regexp
          new_name: knative.dev_eventing.$$${1}
    receivers:
      prometheus_simple:
        collection_interval: 10s
        endpoint: localhost:9090
        use_service_account: true
    service:
      pipelines:
        metrics:
          exporters:
          - otlp
          processors:
          - metricstransform
          receivers:
          - prometheus_simple
  configVersions: 3
  daemonSetUpdateStrategy: {}
  deploymentUpdateStrategy: {}
  ingress:
    route: {}
  ipFamilyPolicy: SingleStack
  managementState: managed
  mode: sidecar
  observability:
    metrics: {}
  podDnsConfig: {}
  replicas: 1
  resources: {}
  targetAllocator:
    allocationStrategy: consistent-hashing
    filterStrategy: relabel-config
    observability:
      metrics: {}
    prometheusCR:
      scrapeInterval: 30s
    resources: {}
  upgradeStrategy: automatic
status:
  version: 0.107.0

@open-telemetry/operator-approvers we could add logic to https://github.com/open-telemetry/opentelemetry-operator/blob/ba2eba68196349a390d56904ff6afbfa79b38af4/internal/manifests/collector/config_replace.go#L87 to replace all instances of $$ with $$$. This would break users who have already figured out this double escaping hell and are already doing $$$.

Zurvarian commented 2 months ago

Amazing, thanks! @TylerHelmuth