open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.13k stars 2.4k forks source link

OpenTelemetry collector failed to boot up when passing in match group references (${1}, ${2}, ...) to Prometheus receiver #35733

Open chc5 opened 1 month ago

chc5 commented 1 month ago

Component(s)

receiver/prometheus

What happened?

Description

OpenTelemetry collector from v0.105.0 and onwards does not work for my set of configurations that relies on appending the port number to the address to scrape metrics from other Kubernetes pods with Prometheus receiver. It previously works for version 0.104.0 and below, but I saw changes that went in like confmap.strictlyTypedInput and confmap.unifyEnvVarExpansion that may have caused my set of configurations to be incompatible and it doesn't seem like there's any alternative solution to address this from further research.

Steps to Reproduce

Create a prometheus receiver that uses relabel_configs and use match group references in replacement substituted by their value. : https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config

For example, in OpenTelemetry I would set this to $$1:$$2 to escape environment variable resolution: Reference

- action: replace
--
  | regex: ([^:]+)(?::\d+)?;(\d+)
  | replacement: $$1:$$2
  | source_labels:
  | - __address__
  | - __meta_kubernetes_pod_annotation_prometheus_io_port
  | target_label: __address__

Expected Result

OpenTelemetry collector should continue to support $$1:$$2 or provide an alternate solution to allow named variables to be passed in like $${__address__}:$${__meta_kubernetes_pod_annotation_prometheus_io_port}.

Actual Result

OpenTelemetry fails to boot up with the following error with $$1:$$2:

Error: failed to get config: cannot resolve the configuration: cannot convert the confmap.Conf: environment variable "2" has invalid name: must match regex ^[a-zA-Z_][a-zA-Z0-9_]*$
2024/10/10 19:14:49 Failed to run the service: failed to get config: cannot resolve the configuration: cannot convert the confmap.Conf: environment variable "2" has invalid name: must match regex ^[a-zA-Z_][a-zA-Z0-9_]*$

Collector version

v0.104.0 works, but any version higher than 0.104.0 produces this bug.

Environment information

Environment

OS: Compiler(if manually compiled): golang:1.22

OpenTelemetry Collector configuration

exporters:
  googlecloud:
    metric:
      endpoint: monitoring.googleapis.com:443
      instrumentation_library_labels: "false"
      prefix: custom.googleapis.com
      service_resource_labels: "false"
      skip_create_descriptor: "true"
    project: test-tenant-project-id
processors:
  batch:
    send_batch_size: 500
    timeout: 10s
  filter/apps:
    metrics:
      include:
        match_type: regexp
        metric_names:
        - server_nio
  memory_limiter/prevent_oom:
    check_interval: 30s
    limit_percentage: 80
    spike_limit_percentage: 30
  metricstransform/apps:
    transforms:
    - action: update
      include: server_nio
      new_name: custom.googleapis.com/server/nio
      operations:
      - action: aggregate_labels
        aggregation_type: sum
        label_set:
        - state
      - action: toggle_scalar_data_type
  resource/container:
    attributes:
    - action: delete
      pattern: net.*
    - action: delete
      pattern: service.*
    - action: delete
      key: http.scheme
    - action: delete
      key: method
    - action: upsert
      key: cloud.region
      value: us-west1
    - action: upsert
      key: k8s.cluster.name
      value: test-cluster-name
receivers:
  prometheus/apps:
    config:
      scrape_configs:
      - job_name: prometheus-scraper
        kubernetes_sd_configs:
        - namespaces:
            names:
            - test-ns
          role: pod
          selectors:
          - field: spec.nodeName=${NODE_NAME},metadata.name!=${POD_NAME}
            label: foo.com/platform=gke
            role: pod
        metric_relabel_configs:
        - action: keep
          regex: server_nio
          source_labels:
          - __name__
        relabel_configs:
        - action: keep
          regex: true
          source_labels:
          - __meta_kubernetes_pod_annotation_prometheus_io_scrape
        - action: drop
          regex: true
          source_labels:
          - __meta_kubernetes_pod_container_init
        - action: replace
          regex: (.+)
          source_labels:
          - __meta_kubernetes_pod_annotation_prometheus_io_scheme
          target_label: __scheme__
        - action: replace
          regex: (.+)
          source_labels:
          - __meta_kubernetes_pod_annotation_prometheus_io_path
          target_label: __metrics_path__
        - action: replace
          regex: (.+)
          source_labels:
          - __meta_kubernetes_pod_annotation_prometheus_io_type
          target_label: __param_type
        - action: replace
          regex: ([^:]+)(?::\d+)?;(\d+)
          replacement: $$1:$$2
          source_labels:
          - __address__
          - __meta_kubernetes_pod_annotation_prometheus_io_port
          target_label: __address__
        - action: replace
          source_labels:
          - __meta_kubernetes_pod_label_org
          target_label: org
        - action: replace
          source_labels:
          - __meta_kubernetes_pod_label_env
          target_label: env
        - action: replace
          source_labels:
          - __meta_kubernetes_pod_label_instance_id
          target_label: instance_id
        - action: replace
          source_labels:
          - __meta_kubernetes_pod_label_com_version
          target_label: runtime_version
        - action: replace
          replacement: clusters/test-cluster-name/pods/$$1
          source_labels:
          - __meta_kubernetes_pod_uid
          target_label: _uid
        scrape_interval: 60s
        scrape_timeout: 60s
        tls_config:
          insecure_skip_verify: true
    use_start_time_metric: false
service:
  extensions:
  - health_check
  pipelines:
    metrics/apps:
      exporters:
      - googlecloud
      processors:
      - memory_limiter/prevent_oom
      - batch
      - filter/apps
      - resource/container
      - metricstransform/apps
      receivers:
      - prometheus/apps
  telemetry:
    logs:
      level: debug
      output_paths: stdout
    metrics:
      address: :9091

Log output

Error: failed to get config: cannot resolve the configuration: cannot convert the confmap.Conf: environment variable "2" has invalid name: must match regex ^[a-zA-Z_][a-zA-Z0-9_]*$
2024/10/10 19:14:49 Failed to run the service: failed to get config: cannot resolve the configuration: cannot convert the confmap.Conf: environment variable "2" has invalid name: must match regex ^[a-zA-Z_][a-zA-Z0-9_]*$

Additional context

https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/rfcs/env-vars.md#issues-of-current-behavior https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/9984

github-actions[bot] commented 1 month ago

Pinging code owners:

dashpole commented 1 month ago

@mx-psi I haven't been following the configuration work closely enough to answer this. Do you know what prometheus users should do going forward?

mx-psi commented 1 month ago

I am unable to reproduce, with the original file I get the following errors:

Error log with file provided in original post (click to expand) ``` ❯ ./otelcol-contrib --config config.yaml 2024-10-11T10:32:07.587-0400warnenvprovider@v1.17.0/provider.go:59Configuration references unset environment variable{"name": "NODE_NAME"} 2024-10-11T10:32:07.587-0400warnenvprovider@v1.17.0/provider.go:59Configuration references unset environment variable{"name": "POD_NAME"} Error: failed to get config: cannot unmarshal the configuration: decoding failed due to the following error(s): error decoding 'exporters': error reading configuration for "googlecloud": decoding failed due to the following error(s): 'metric.skip_create_descriptor' expected type 'bool', got unconvertible type 'string', value: 'true' 'metric.instrumentation_library_labels' expected type 'bool', got unconvertible type 'string', value: 'false' 'metric.service_resource_labels' expected type 'bool', got unconvertible type 'string', value: 'false' 2024/10/11 10:32:07 collector server run finished with error: failed to get config: cannot unmarshal the configuration: decoding failed due to the following error(s): error decoding 'exporters': error reading configuration for "googlecloud": decoding failed due to the following error(s): 'metric.skip_create_descriptor' expected type 'bool', got unconvertible type 'string', value: 'true' 'metric.instrumentation_library_labels' expected type 'bool', got unconvertible type 'string', value: 'false' 'metric.service_resource_labels' expected type 'bool', got unconvertible type 'string', value: 'false' ```

With a fixed file:

Fixed file (click to expand) ```yaml extensions: health_check: exporters: googlecloud: metric: endpoint: monitoring.googleapis.com:443 instrumentation_library_labels: false prefix: custom.googleapis.com service_resource_labels: false skip_create_descriptor: true project: test-tenant-project-id processors: batch: send_batch_size: 500 timeout: 10s filter/apps: metrics: include: match_type: regexp metric_names: - server_nio memory_limiter/prevent_oom: check_interval: 30s limit_percentage: 80 spike_limit_percentage: 30 metricstransform/apps: transforms: - action: update include: server_nio new_name: custom.googleapis.com/server/nio operations: - action: aggregate_labels aggregation_type: sum label_set: - state - action: toggle_scalar_data_type resource/container: attributes: - action: delete pattern: net.* - action: delete pattern: service.* - action: delete key: http.scheme - action: delete key: method - action: upsert key: cloud.region value: us-west1 - action: upsert key: k8s.cluster.name value: test-cluster-name receivers: prometheus/apps: config: scrape_configs: - job_name: prometheus-scraper kubernetes_sd_configs: - namespaces: names: - test-ns role: pod selectors: - field: spec.nodeName=${NODE_NAME},metadata.name!=${POD_NAME} label: foo.com/platform=gke role: pod metric_relabel_configs: - action: keep regex: server_nio source_labels: - __name__ relabel_configs: - action: keep regex: true source_labels: - __meta_kubernetes_pod_annotation_prometheus_io_scrape - action: drop regex: true source_labels: - __meta_kubernetes_pod_container_init - action: replace regex: (.+) source_labels: - __meta_kubernetes_pod_annotation_prometheus_io_scheme target_label: __scheme__ - action: replace regex: (.+) source_labels: - __meta_kubernetes_pod_annotation_prometheus_io_path target_label: __metrics_path__ - action: replace regex: (.+) source_labels: - __meta_kubernetes_pod_annotation_prometheus_io_type target_label: __param_type - action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $$1:$$2 source_labels: - __address__ - __meta_kubernetes_pod_annotation_prometheus_io_port target_label: __address__ - action: replace source_labels: - __meta_kubernetes_pod_label_org target_label: org - action: replace source_labels: - __meta_kubernetes_pod_label_env target_label: env - action: replace source_labels: - __meta_kubernetes_pod_label_instance_id target_label: instance_id - action: replace source_labels: - __meta_kubernetes_pod_label_com_version target_label: runtime_version - action: replace replacement: clusters/test-cluster-name/pods/$$1 source_labels: - __meta_kubernetes_pod_uid target_label: _uid scrape_interval: 60s scrape_timeout: 60s tls_config: insecure_skip_verify: true use_start_time_metric: false service: extensions: - health_check pipelines: metrics/apps: exporters: - googlecloud processors: - memory_limiter/prevent_oom - batch - filter/apps - resource/container - metricstransform/apps receivers: - prometheus/apps telemetry: logs: level: debug output_paths: stdout metrics: address: :9091 ```

The config validates (I get a different error but it's just wrong setup):

Logs with fixed config ``` ❯ ./otelcol-contrib --config fixed-config.yaml 2024-10-11T10:33:05.084-0400 info service@v0.111.0/service.go:136 Setting up own telemetry... 2024-10-11T10:33:05.084-0400 warn service@v0.111.0/service.go:191 service::telemetry::metrics::address is being deprecated in favor of service::telemetry::metrics::readers 2024-10-11T10:33:05.084-0400 info telemetry/metrics.go:70 Serving metrics {"address": ":9091", "metrics level": "Normal"} 2024-10-11T10:33:05.085-0400 debug builders/builders.go:24 Beta component. May change in the future.{"kind": "exporter", "data_type": "metrics", "name": "googlecloud"} 2024-10-11T10:33:05.085-0400 debug builders/builders.go:24 Beta component. May change in the future.{"kind": "processor", "name": "metricstransform/apps", "pipeline": "metrics/apps"} 2024-10-11T10:33:05.085-0400 debug builders/builders.go:24 Beta component. May change in the future.{"kind": "processor", "name": "resource/container", "pipeline": "metrics/apps"} 2024-10-11T10:33:05.085-0400 debug builders/builders.go:24 Alpha component. May change in the future. {"kind": "processor", "name": "filter/apps", "pipeline": "metrics/apps"} 2024-10-11T10:33:05.085-0400 info filterprocessor@v0.111.0/metrics.go:98 Metric filter configured{"kind": "processor", "name": "filter/apps", "pipeline": "metrics/apps", "include match_type": "regexp", "include expressions": [], "include metric names": ["server_nio"], "include metrics with resource attributes": null, "exclude match_type": "", "exclude expressions": [], "exclude metric names": [], "exclude metrics with resource attributes": null} 2024-10-11T10:33:05.085-0400 debug builders/builders.go:24 Beta component. May change in the future.{"kind": "processor", "name": "batch", "pipeline": "metrics/apps"} 2024-10-11T10:33:05.085-0400 debug builders/builders.go:24 Beta component. May change in the future.{"kind": "processor", "name": "memory_limiter/prevent_oom", "pipeline": "metrics/apps"} 2024-10-11T10:33:05.086-0400 info memorylimiter/memorylimiter.go:151 Using percentage memory limiter {"kind": "processor", "name": "memory_limiter/prevent_oom", "pipeline": "metrics/apps", "total_memory_mib": 31765, "limit_percentage": 80, "spike_limit_percentage": 30} 2024-10-11T10:33:05.086-0400 info memorylimiter/memorylimiter.go:75 Memory limiter configured{"kind": "processor", "name": "memory_limiter/prevent_oom", "pipeline": "metrics/apps", "limit_mib": 25412, "spike_limit_mib": 9529, "check_interval": 30} 2024-10-11T10:33:05.086-0400 debug builders/builders.go:24 Beta component. May change in the future.{"kind": "receiver", "name": "prometheus/apps", "data_type": "metrics"} 2024-10-11T10:33:05.086-0400 debug builders/extension.go:48 Beta component. May change in the future. {"kind": "extension", "name": "health_check"} 2024-10-11T10:33:05.072-0400 warn envprovider@v1.17.0/provider.go:59 Configuration references unset environment variable {"name": "NODE_NAME"} 2024-10-11T10:33:05.072-0400 warn envprovider@v1.17.0/provider.go:59 Configuration references unset environment variable {"name": "POD_NAME"} 2024-10-11T10:33:05.087-0400 info service@v0.111.0/service.go:208 Starting otelcol-contrib... {"Version": "0.111.0", "NumCPU": 20} 2024-10-11T10:33:05.087-0400 info extensions/extensions.go:39 Starting extensions... 2024-10-11T10:33:05.087-0400 info extensions/extensions.go:42 Extension is starting... {"kind": "extension", "name": "health_check"} 2024-10-11T10:33:05.087-0400 info healthcheckextension@v0.111.0/healthcheckextension.go:33 Starting health_check extension {"kind": "extension", "name": "health_check", "config": {"Endpoint":"localhost:13133","TLSSetting":null,"CORS":null,"Auth":null,"MaxRequestBodySize":0,"IncludeMetadata":false,"ResponseHeaders":null,"CompressionAlgorithms":null,"ReadTimeout":0,"ReadHeaderTimeout":0,"WriteTimeout":0,"IdleTimeout":0,"Path":"/","ResponseBody":null,"CheckCollectorPipeline":{"Enabled":false,"Interval":"5m","ExporterFailureThreshold":5}}} 2024-10-11T10:33:05.088-0400 info extensions/extensions.go:59 Extension started. {"kind": "extension", "name": "health_check"} 2024-10-11T10:33:05.114-0400 error graph/graph.go:426 Failed to start component {"error": "error finding default application credentials: google: could not find default credentials. See https://cloud.google.com/docs/authentication/external/set-up-adc for more information", "type": "Exporter", "id": "googlecloud"} 2024-10-11T10:33:05.114-0400 info service@v0.111.0/service.go:270 Starting shutdown... 2024-10-11T10:33:05.114-0400 info healthcheck/handler.go:132 Health Check state change {"kind": "extension", "name": "health_check", "status": "unavailable"} 2024-10-11T10:33:05.115-0400 info extensions/extensions.go:66 Stopping extensions... 2024-10-11T10:33:05.115-0400 info service@v0.111.0/service.go:284 Shutdown complete. Error: cannot start pipelines: error finding default application credentials: google: could not find default credentials. See https://cloud.google.com/docs/authentication/external/set-up-adc for more information; failed to shutdown pipelines: no existing monitoring routine is running 2024/10/11 10:33:05 collector server run finished with error: cannot start pipelines: error finding default application credentials: google: could not find default credentials. See https://cloud.google.com/docs/authentication/external/set-up-adc for more information; failed to shutdown pipelines: no existing monitoring routine is running ```

@TylerHelmuth could you also take a look? Could this be operator-specific? (Unclear what the environment we are talking about here)

chc5 commented 1 month ago

I believe the error that you're seeing could be related to googlecloudexporter not having the right credentials. I've trimmed down the config to only use prometheus receiver along with other basic processors and exporters. I hope this config works for you to reproduce the main error on your end:

Revised config (click to expand)
```yaml exporters: debug: verbosity: detailed processors: batch: send_batch_size: 500 timeout: 10s receivers: prometheus/apps: config: scrape_configs: - job_name: prometheus-scraper kubernetes_sd_configs: - namespaces: names: - test-ns role: pod selectors: - field: spec.nodeName=${NODE_NAME},metadata.name!=${POD_NAME} label: foo.com/platform=gke role: pod metric_relabel_configs: - action: keep regex: server_nio source_labels: - __name__ relabel_configs: - action: keep regex: true source_labels: - __meta_kubernetes_pod_annotation_prometheus_io_scrape - action: drop regex: true source_labels: - __meta_kubernetes_pod_container_init - action: replace regex: (.+) source_labels: - __meta_kubernetes_pod_annotation_prometheus_io_scheme target_label: __scheme__ - action: replace regex: (.+) source_labels: - __meta_kubernetes_pod_annotation_prometheus_io_path target_label: __metrics_path__ - action: replace regex: (.+) source_labels: - __meta_kubernetes_pod_annotation_prometheus_io_type target_label: __param_type - action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $$1:$$2 source_labels: - __address__ - __meta_kubernetes_pod_annotation_prometheus_io_port target_label: __address__ - action: replace source_labels: - __meta_kubernetes_pod_label_org target_label: org - action: replace source_labels: - __meta_kubernetes_pod_label_env target_label: env - action: replace source_labels: - __meta_kubernetes_pod_label_instance_id target_label: instance_id - action: replace source_labels: - __meta_kubernetes_pod_label_com_version target_label: runtime_version - action: replace replacement: clusters/test-cluster-name/pods/$$1 source_labels: - __meta_kubernetes_pod_uid target_label: _uid scrape_interval: 60s scrape_timeout: 60s tls_config: insecure_skip_verify: true use_start_time_metric: false service: pipelines: metrics/apps: exporters: - debug processors: - batch receivers: - prometheus/apps telemetry: logs: level: debug output_paths: stdout metrics: address: :9091 ```
mx-psi commented 1 month ago

After adding the health_check extension it works fine for me.

I tested this with the following steps (Linux amd64 machine):

❯ curl -L -o contrib0.111.tar.gz https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.111.0/otelcol-contrib_0.111.0_linux_amd64.tar.gz
❯ tar pfx contrib0.111.tar.gz 
❯ ./otelcol-contrib --config config.yaml 

and it seems to run fine. So again, I think this may be something specific to how you are running your Collector. Are you using the operator?