open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.88k stars 2.25k forks source link

[Prometheus Receiver] Prometheus Receiver configuration for etcd, kube-scheduler and kube-controller in standard K8s cluster not working #34211

Closed developer1622 closed 2 weeks ago

developer1622 commented 1 month ago

Component(s)

receiver/prometheus

What happened?

Description

Please bear with me for descriptive error message, however actually it is short.

I am trying to scrape prometheus metrics for the etcd, kube-scheduler and kube-controller. However, it resulting in the error , i have tried multiple relabel configurations to get the end URL address coreect, however it is still not working

I execed into pod and used curl to scrape respective pod ip targets, all worked but with scraping config, it is not working.

Steps to Reproduce

Keep the following scrape config in under receiver section of receiver


    receivers:
      prometheus:
        config:
          scrape_configs:
            - job_name: etcd
              scheme: https
              kubernetes_sd_configs:
                - role: pod
              relabel_configs:
                - action: keep
                  source_labels:
                    - __meta_kubernetes_namespace
                    - __meta_kubernetes_pod_name
                  separator: "/"
                  regex: "kube-system/etcd.+"

               #  **This did not work**
                # - source_labels:
                #     - __address__
                #   action: replace
                #   target_label: __address__
                #   regex: (.+?)(\\:\\d)?
                #   replacement: $1:2379

                # Specify the port
                - source_labels: [__meta_kubernetes_pod_ip]
                  action: replace
                  target_label: __address__
                  regex: (.*)
                  replacement: ${__meta_kubernetes_pod_ip}:2379
                  # **here below for the replacement I have tried multiple options, none worked**
                  # replacement: $1:2379
                 # replacement: ${1}:2379

              tls_config:
                insecure_skip_verify: true
                ca_file: /etc/etcd/ca.crt
                cert_file: /etc/etcd/server.crt
                key_file: /etc/etcd/server.key

            - job_name: kube-controller-manager
              honor_labels: true
              kubernetes_sd_configs:
                - role: pod
                  namespaces:
                    names:
                      - kube-system
              bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
              scheme: https
              tls_config:
                insecure_skip_verify: true
              relabel_configs:
                # Keep pods with the specified labels
                - source_labels:
                    [
                      __meta_kubernetes_pod_label_component,
                      __meta_kubernetes_pod_label_tier,
                    ]
                  action: keep
                  regex: kube-controller-manager;control-plane

                # Replace the address to use the pod IP with port 10257
                - source_labels: [__meta_kubernetes_pod_ip]
                  action: replace
                  target_label: __address__
                  regex: (.*)
                  # replacement: $1:10259
                  replacement: ${__meta_kubernetes_pod_ip}:10259
                  # **here below for the replacement I have tried multiple options, none worked**
                 # replacement: ${1}:10259

            - job_name: kube-scheduler
              honor_labels: true
              kubernetes_sd_configs:
                - role: pod
                  namespaces:
                    names:
                      - kube-system
              bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
              scheme: https
              tls_config:
                insecure_skip_verify: true
              relabel_configs:
                # Keep pods with the specified labels
                - source_labels:
                    [
                      __meta_kubernetes_pod_label_component,
                      __meta_kubernetes_pod_label_tier,
                    ]
                  action: keep
                  regex: kube-scheduler;control-plane

                # Replace the address to use the pod IP with port 10250
                # - source_labels: [__meta_kubernetes_pod_ip]
                #   action: replace
                #   target_label: __address__
                #   # regex: (.*)
                #   regex: ^(.*)$
                  # replacement: ${__meta_kubernetes_pod_ip}:10257

                - source_labels: [__meta_kubernetes_pod_ip]
                  action: replace
                  target_label: __address__
                  regex: ^(.*)$  # Captures the entire IP address
                  replacement: ${1}:10257
                  replacement: ${__meta_kubernetes_pod_ip}:10257
                  # **here below for the replacement I have tried multiple options, none worked**
                 # replacement: ${1}:10257

Expected Result

We usually see the metrics wherver being exported

Actual Result

I have tried multiple ,so I have got multiple errors, I will post all of them here


2024-07-23T07:19:14.872Z        warn    expandconverter@v0.102.1/expand.go:107  Configuration references unset environment variable     {"name": "__meta_kubernetes_pod_ip"}
2024-07-23T07:19:14.872Z        warn    expandconverter@v0.102.1/expand.go:107  Configuration references unset environment variable     {"name": "__meta_kubernetes_pod_ip"}
Error: failed to resolve config: cannot resolve the configuration: cannot convert the confmap.Conf: environment variable "1" has invalid name: must match regex ^[a-zA-Z_][a-zA-Z0-9_]*$
2024/07/23 07:19:14 collector server run finished with error: failed to resolve config: cannot resolve the configuration: cannot convert the confmap.Conf: environment variable "1" has invalid name: must match regex ^[a-zA-Z_][a-zA-Z0-9_]*$

Second

Error: failed to resolve config: cannot resolve the configuration: cannot convert the confmap.Conf: environment variable "1" has invalid name: must match regex ^[a-zA-Z_][a-zA-Z0-9_]*$
2024/07/23 07:21:46 collector server run finished with error: failed to resolve config: cannot resolve the configuration: cannot convert the confmap.Conf: environment variable "1" has invalid name: must match regex ^[a-zA-Z_][a-zA-Z0-9_]*$

and third

Seems like it is not building the complete URL , which we can see below error , instance for all 3 components

2024-07-23T07:24:25.931Z        warn    internal/transaction.go:125     Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1721719465929, "target_labels": "{__name__=\"up\", instance=\":2379\", job=\"etcd\"}"}

2024-07-23T07:24:33.006Z        warn    internal/transaction.go:125     Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1721719473005, "target_labels": "{__name__=\"up\", instance=\":10257\", job=\"kube-controller-manager\"}"}

2024-07-23T07:25:17.586Z        warn    internal/transaction.go:125     Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1721719517584, "target_labels": "{__name__=\"up\", instance=\":10259\", job=\"kube-scheduler\"}"}

Collector version

latest image of contrib. here: otel/opentelemetry-collector-contrib:latest

Environment information

Environment

OS: (e.g., "Ubuntu 20.04") Compiler(if manually compiled): (e.g., "go 22") it is mult-control plane K8s cluster I have 3 control plane nodes

so, I have 3 etcd services, 3 kube-control-mangers and 3 kube-schedulers

OpenTelemetry Collector configuration

apiVersion: v1
kind: ConfigMap
metadata:
  labels:
    app: otelcontribcol
  name: otelcontribcol
  namespace: default
data:
  config.yaml: |
    receivers:
      prometheus:
        config:
          scrape_configs:
            - job_name: etcd
              scheme: https
              kubernetes_sd_configs:
                - role: pod
              relabel_configs:
                - action: keep
                  source_labels:
                    - __meta_kubernetes_namespace
                    - __meta_kubernetes_pod_name
                  separator: "/"
                  regex: "kube-system/etcd.+"

                # - source_labels:
                #     - __address__
                #   action: replace
                #   target_label: __address__
                #   regex: (.+?)(\\:\\d)?
                #   replacement: $1:2379

                # Specify the port
                - source_labels: [__meta_kubernetes_pod_ip]
                  action: replace
                  target_label: __address__
                  regex: (.*)
                  # replacement: $1:2379
                  replacement: ${__meta_kubernetes_pod_ip}:2379

              tls_config:
                insecure_skip_verify: true
                ca_file: /etc/etcd/ca.crt
                cert_file: /etc/etcd/server.crt
                key_file: /etc/etcd/server.key

            - job_name: kube-controller-manager
              honor_labels: true
              kubernetes_sd_configs:
                - role: pod
                  namespaces:
                    names:
                      - kube-system
              bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
              scheme: https
              tls_config:
                insecure_skip_verify: true
              relabel_configs:
                # Keep pods with the specified labels
                - source_labels:
                    [
                      __meta_kubernetes_pod_label_component,
                      __meta_kubernetes_pod_label_tier,
                    ]
                  action: keep
                  regex: kube-controller-manager;control-plane

                # Replace the address to use the pod IP with port 10257
                - source_labels: [__meta_kubernetes_pod_ip]
                  action: replace
                  target_label: __address__
                  regex: (.*)
                  # replacement: $1:10257
                  replacement: ${__meta_kubernetes_pod_ip}:10257

            - job_name: kube-scheduler
              honor_labels: true
              kubernetes_sd_configs:
                - role: pod
                  namespaces:
                    names:
                      - kube-system
              bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
              scheme: https
              tls_config:
                insecure_skip_verify: true
              relabel_configs:
                # Keep pods with the specified labels
                - source_labels:
                    [
                      __meta_kubernetes_pod_label_component,
                      __meta_kubernetes_pod_label_tier,
                    ]
                  action: keep
                  regex: kube-scheduler;control-plane

                # Replace the address to use the pod IP with port 10250
                # - source_labels: [__meta_kubernetes_pod_ip]
                #   action: replace
                #   target_label: __address__
                #   # regex: (.*)
                #   regex: ^(.*)$
                # replacement: ${__meta_kubernetes_pod_ip}:10257

                - source_labels: [__meta_kubernetes_pod_ip]
                  action: replace
                  target_label: __address__
                  regex: ^(.*)$  # Captures the entire IP address
                  # replacement: ${1}:10257
                  replacement: ${__meta_kubernetes_pod_ip}:10259

    processors:
      batch:
        timeout: 1s
        send_batch_size: 1000
        send_batch_max_size: 2000

    exporters:
      debug:
        verbosity: detailed

    service:
      telemetry:
        metrics:
          address: 0.0.0.0:8881
      pipelines:
        metrics:
          receivers: [prometheus]
          processors: [batch]
          exporters: [debug]

Log output

In top I have attached logs more.

2024-07-23T07:24:25.931Z        warn    internal/transaction.go:125     Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1721719465929, "target_labels": "{__name__=\"up\", instance=\":2379\", job=\"etcd\"}"}

2024-07-23T07:24:33.006Z        warn    internal/transaction.go:125     Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1721719473005, "target_labels": "{__name__=\"up\", instance=\":10257\", job=\"kube-controller-manager\"}"}

2024-07-23T07:25:17.586Z        warn    internal/transaction.go:125     Failed to scrape Prometheus endpoint    {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1721719517584, "target_labels": "{__name__=\"up\", instance=\":10259\", job=\"kube-scheduler\"}"}

Additional context

it is mult-control plane K8s cluster

Thank you, I have tried to build target, seems like it is not if my scraping is not correct, please give correct scraping config for 3 K8s components

Here are my pods of all 3 comoponents ,

kube-scheduler-master01                    1/1     Running       8 (18d ago)   237d   component=kube-scheduler,tier=control-plane
kube-scheduler-master02                    1/1     Running       4 (43d ago)   237d   component=kube-scheduler,tier=control-plane
kube-scheduler-master03                    1/1     Running       6 (18d ago)   237d   component=kube-scheduler,tier=control-plane

kube-controller-manager-master01           1/1     Running       8 (18d ago)   237d   component=kube-controller-manager,tier=control-plane
kube-controller-manager-master02           1/1     Running       4 (43d ago)   237d   component=kube-controller-manager,tier=control-plane
kube-controller-manager-master03           1/1     Running       6 (18d ago)   237d   component=kube-controller-manager,tier=control-plane

etcd-master01                              1/1     Running       4 (27d ago)   237d   component=etcd,tier=control-plane
etcd-master02                              1/1     Running       2 (43d ago)   237d   component=etcd,tier=control-plane
etcd-master03                              1/1     Running       1 (27d ago)   237d   component=etcd,tier=control-plane

Thank you

github-actions[bot] commented 1 month ago

Pinging code owners:

developer1622 commented 1 month ago

In case my standard Prometheus deployment (attached YAML below) I can see below target to build, but in OTel Prometheus receiver throwing errors.

Screenshot 2024-07-23 at 3 59 44 PM

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  labels:
    app: prometheus
data:
  prometheus.yml: |
    global:
      scrape_interval: 2m
      evaluation_interval: 2m
    scrape_configs:
      - job_name: etcd
        scheme: https
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          # Keep only etcd pods in the kube-system namespace
          - action: keep
            source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_pod_name]
            separator: /
            regex: "kube-system/etcd.+"

          # Replace the address to use the pod IP with port 2379
          - source_labels: [__meta_kubernetes_pod_ip]
            action: replace
            target_label: __address__
            regex: (.*)
            replacement: $1:2379

        tls_config:
          insecure_skip_verify: true
          ca_file: /etc/etcd/ca.crt
          cert_file: /etc/etcd/server.crt
          key_file: /etc/etcd/server.key

      - job_name: kube-controller-manager
        honor_labels: true
        kubernetes_sd_configs:
          - role: pod
            namespaces:
              names:
                - kube-system
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        scheme: https
        tls_config:
          insecure_skip_verify: true
        relabel_configs:
          # Keep pods with the specified labels
          - source_labels: [__meta_kubernetes_pod_label_component, __meta_kubernetes_pod_label_tier]
            action: keep
            regex: kube-controller-manager;control-plane

          # Replace the address to use the pod IP with port 10257
          - source_labels: [__meta_kubernetes_pod_ip]
            action: replace
            target_label: __address__
            regex: (.*)
            replacement: $1:10257

      - job_name: kube-scheduler
        honor_labels: true
        kubernetes_sd_configs:
          - role: pod
            namespaces:
              names:
                - kube-system
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        scheme: https
        tls_config:
          insecure_skip_verify: true
        relabel_configs:
          # Keep pods with the specified labels
          - source_labels: [__meta_kubernetes_pod_label_component, __meta_kubernetes_pod_label_tier]
            action: keep
            regex: kube-scheduler;control-plane

          # Replace the address to use the pod IP with port 10250
          - source_labels: [__meta_kubernetes_pod_ip]
            action: replace
            target_label: __address__
            regex: (.*)
            replacement: $1:10259

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      containers:
        - name: prometheus-cont
          image: prom/prometheus
          volumeMounts:
            - name: config-volume
              mountPath: /etc/prometheus/prometheus.yml
              subPath: prometheus.yml
            - mountPath: /etc/etcd
              name: etcd-certs
          ports:
            - containerPort: 9090
      volumes:
        - name: config-volume
          configMap:
            name: prometheus-config
        - configMap:
            name: etcd-certs
          name: etcd-certs
      hostNetwork: true
      serviceAccount: otelcontribcol
      serviceAccountName: otelcontribcol
---
kind: Service
apiVersion: v1
metadata:
  name: prometheus-service
spec:
  selector:
    app: prometheus
  ports:
    - name: promui
      nodePort: 30900
      protocol: TCP
      port: 9090
      targetPort: 9090
  type: NodePort

Thank you.

dashpole commented 1 month ago

I skimmed the issue, so apologize if I missed this. The otel collector interprets $1 as the environment variable. You need to escape it with $$1

dashpole commented 1 month ago

LMK if that was your issue, or if I misread

developer1622 commented 1 month ago

Hi @dashpole.

Thank you very much for the response; it worked after using 2 dollars($). You saved actually.

So, whatever works in standard Prometheus needs tweaking for running in OTel Prometheus receiver?

Is the deviation in OTel from the standard Prometheus scrape config something architecturally specific that end users need to know?

Thank you.

dashpole commented 1 month ago

It exists because the promethues server config doesn't support environment variables, but the otel collector does.

developer1622 commented 1 month ago

Hi @dashpole , I have forgot to ask one query, thank you

I have the below kube-scheduler scrape config which is working fine

            - job_name: kube-scheduler
              honor_labels: true
              kubernetes_sd_configs:
                - role: pod
                  namespaces:
                    names:
                      - kube-system
              bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
              scheme: https
              tls_config:
                insecure_skip_verify: true
              relabel_configs:
                # Keep pods with the specified labels
                - source_labels:
                    [
                      __meta_kubernetes_pod_label_component,
                      __meta_kubernetes_pod_label_tier,
                    ]
                  action: keep
                  regex: kube-scheduler;control-plane

                # Replace the address to use the pod IP with port 10250
                - source_labels: [__meta_kubernetes_pod_ip]
                  action: replace
                  target_label: __address__
                  regex: (.*)
                  replacement: $$1:10259

So, with this configuration, I am able to see only one instance scheduler metrics ( I have 3 control plane nodes, so that means I have 3 schedulers)

Is this expected behaviour in multi control plane (multi-master) K8s clusters? other scrape configs(other 2 control plane schedulers) are failing, but one of them is successful

Here is the sample log of 2 instances failing

Failed to scrape Prometheus endpoint {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1722538620170, "target_labels": "{name=\"up\", instance=\".11:10259\", job=\"kube-scheduler\"}"}

Failed to scrape Prometheus endpoint {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1722538620179, "target_labels": "{name=\"up\", instance=\".10:10259\", job=\"kube-scheduler\"}"}

Thank you.

dashpole commented 1 month ago

I would expect 3 metrics. Try raising the logging verbosity to DEBUG to see the detailed scrape failure reason