open-telemetry / opentelemetry-operator

Kubernetes Operator for OpenTelemetry Collector
Apache License 2.0
1.16k stars 413 forks source link

target allocator discover ignore the __address__ in prometheus receiver config #3145

Closed Tiremisu closed 1 week ago

Tiremisu commented 1 month ago

Component(s)

target allocator

What happened?

Description

target allocator discover ignore the address in prometheus receiver config

Steps to Reproduce

config the target allocator with following

config:
  global:
    scrape_interval: 30
  scrape_configs: 
    ## scraping metrics basing on annotations:
    ##   - prometheus.io/scrape: true - to scrape metrics from the pod
    ##   - prometheus.io/path: /metrics - path which the metric should be scrape from
    ##   - prometheus.io/port: 9113 - port which the metric should be scrape from
    ## rel: https://github.com/prometheus-operator/kube-prometheus/pull/16#issuecomment-424318647
    - job_name: "pod-annotations"
      kubernetes_sd_configs:
        - role: pod
      relabel_configs:
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
        - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
          action: replace
          regex: ([^:]+)(?::\d+)?;(\d+)
          replacement: $1:$2
          target_label: __address__
        - source_labels: [__metrics_path__]
          separator: ;
          regex: (.*)
          target_label: endpoint
          replacement: $1
          action: replace
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: namespace
        - action: labelmap
          regex: __meta_kubernetes_pod_label_(.+)
        - source_labels: [__meta_kubernetes_pod_name]
          separator: ;
          regex: (.*)
          target_label: pod
          replacement: $1
          action: replace

Expected Result

TA will discover that targets with port in __meta_kubernetes_pod_annotation_prometheus_io_port, eg, 9113 instead of discover targets with ports expose by the container, like 8080

Actual Result

it discover all targets with ports exposed by the container, eg, if the pod expose 14250, 4317, but the metric port is 9113, targets discover are:

  1. pod_ip:14250
  2. pot_ip:4317

Kubernetes Version

1.28.7

Operator version

0.49.0/

Collector version

adot-operator-targetallocator:0.94.1

Environment information

Environment

OS: (e.g., "Ubuntu 20.04") Compiler(if manually compiled): (e.g., "go 14.2")

Log output

A bunch of warn log in the collector side:

Failed to scrape prometheus endpoint....

Additional context

We deployed two operators, one is adot-operator and the other one is opentelemetry-operator, current using is opentelemetry-operator with aws images...

jaronoff97 commented 1 month ago

Can you provide some more details from the target allocator, including the response in the /jobs endpoint and logs? It would also be useful to give a reproducible example so we can confirm the bug/behavior.

Tiremisu commented 1 month ago

/job: image

targets discover: image

target allocator logs (no errors) ...successfully started a collector pod watcher ...successfully started a collector pod watcher ...successfully started a collector pod watcher



Collector logs:
![image](https://github.com/user-attachments/assets/01597ffc-6d8e-49ee-b82c-3e4d8f06029e)
jaronoff97 commented 1 month ago

This is expected as the TA is not returning the relabel config in its response to the collector (though we internally may run the relabel config) so it's expected for the collector to do this itself. Are you seeing the collector failing to scrape the expected targets?

Tiremisu commented 1 month ago

@jaronoff97 I mean the TA does not discover the correct port/endpoint! Let's say in the screenshot above, the TA should discover endpoint 10.12.195.160:9091 instead of 10.12.195.160:9900. We expose /metrics at 9091. So when the TA pass this wrong endpoint to metric collector, it says failed to the prometheus endpoint.

Target discover should find the correct prometheus endpoint, right?

swiatekm commented 1 month ago

@jaronoff97 I mean the TA does not discover the correct port/endpoint! Let's say in the screenshot above, the TA should discover endpoint 10.12.195.160:9091 instead of 10.12.195.160:9900. We expose /metrics at 9091. So when the TA pass this wrong endpoint to metric collector, it says failed to the prometheus endpoint.

Target discover should find the correct prometheus endpoint, right?

It does discover the correct port and endpoint. You later want to modify what it discovered, and use the port from the annotation instead of the port exposed by the container. This is currently not done by the target allocator, but rather the collector. The collector will get the target labels as exposed by the target allocator, apply your relabelling steps, and then actually scrape the target.

Unless the collector is not actually scraping the correct targets, this is working as intended.

As an aside, we would like to make the target allocator work the way you expect it to, but this is unfortunately more complicated than one would naively expect.

Tiremisu commented 1 month ago

@swiatekm Thanks for the explanation. Will try to add relabeling steps. Could you also link me the guide?

jaronoff97 commented 1 month ago

the collector should already be doing the relabel config you have provided, i know you linked the target allocator config above, but could you also link the collector CRD you are using?

Tiremisu commented 1 month ago

the collector should already be doing the relabel config you have provided, i know you linked the target allocator config above, but could you also link the collector CRD you are using?

I am using opentelemetry-operator 0.98.0

jaronoff97 commented 1 month ago

okay, can you also link the collector configuration?

Tiremisu commented 1 month ago
    exporters:
      otlphttp:
        disable_keep_alives: true
        endpoint: http://${METADATA_METRICS_SVC}.${NAMESPACE}.svc.cluster.local.:4318
        sending_queue:
          num_consumers: 10
          queue_size: 10000
          storage: file_storage
    extensions:
      file_storage:
        compaction:
          directory: /tmp
          on_rebound: true
        directory: /var/lib/storage/otc
        timeout: 10s
      health_check: {}
      pprof: {}
    processors:
      batch:
        send_batch_max_size: 2000
        send_batch_size: 1000
        timeout: 1s
      filter/drop_stale_datapoints:
        metrics:
          datapoint:
          - flags == FLAG_NO_RECORDED_VALUE
      transform/drop_unnecessary_attributes:
        error_mode: ignore
        metric_statements:
        - context: resource
          statements:
          - delete_key(attributes, "http.scheme")
          - delete_key(attributes, "net.host.name")
          - delete_key(attributes, "net.host.port")
          - delete_key(attributes, "service.instance.id")
          - delete_matching_keys(attributes, "k8s.*")
      transform/extract_sum_count_from_histograms:
        error_mode: ignore
        metric_statements:
        - context: metric
          statements:
          - extract_sum_metric(true) where IsMatch(name, "^(apiserver_request_duration_seconds|coredns_dns_request_duration_seconds|kubelet_runtime_operations_duration_seconds)$")
          - extract_count_metric(true) where IsMatch(name, "^(apiserver_request_duration_seconds|coredns_dns_request_duration_seconds|kubelet_runtime_operations_duration_seconds)$")
    receivers:
      prometheus:
        config:
          global:
            scrape_interval: 30s
          scrape_configs:
          - job_name: pod-annotations
            kubernetes_sd_configs:
            - role: pod
            relabel_configs:
            - action: keep
              regex: true
              source_labels:
              - __meta_kubernetes_pod_annotation_prometheus_io_scrape
            - action: replace
              regex: (.+)
              source_labels:
              - __meta_kubernetes_pod_annotation_prometheus_io_path
              target_label: __metrics_path__
            - action: replace
              regex: ([^:]+)(?::\d+)?;(\d+)
              replacement: $1:$2
              source_labels:
              - __address__
              - __meta_kubernetes_pod_annotation_prometheus_io_port
              target_label: __address__
            - action: replace
              regex: (.*)
              replacement: $1
              separator: ;
              source_labels:
              - __metrics_path__
              target_label: endpoint
            - action: replace
              source_labels:
              - __meta_kubernetes_namespace
              target_label: namespace
            - action: labelmap
              regex: __meta_kubernetes_pod_label_(.+)
            - action: replace
              regex: (.*)
              replacement: $1
              separator: ;
              source_labels:
              - __meta_kubernetes_pod_name
              target_label: pod
          - authorization:
              credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
            honor_labels: true
            job_name: kubelet
            kubernetes_sd_configs:
            - role: node
            metric_relabel_configs:
            - action: keep
              regex: (?:kubelet_docker_operations_errors(?:|_total)|kubelet_(?:docker|runtime)_operations_duration_seconds_(?:count|sum)|kubelet_running_(?:container|pod)(?:_count|s)|kubelet_(:?docker|runtime)_operations_latency_microseconds(?:|_count|_sum))
              source_labels:
              - __name__
            - action: labeldrop
              regex: id
            relabel_configs:
            - source_labels:
              - __meta_kubernetes_node_name
              target_label: node
            - replacement: https-metrics
              target_label: endpoint
            - action: replace
              source_labels:
              - __metrics_path__
              target_label: metrics_path
            - action: replace
              source_labels:
              - __address__
              target_label: instance
            scheme: https
            tls_config:
              ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
              insecure_skip_verify: true
          - authorization:
              credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
            honor_labels: true
            job_name: cadvisor
            kubernetes_sd_configs:
            - role: node
            metric_relabel_configs:
            - action: replace
              regex: .*
              replacement: kubelet
              source_labels:
              - __name__
              target_label: job
            - action: keep
              regex: (?:container_cpu_usage_seconds_total|container_memory_working_set_bytes|container_fs_usage_bytes|container_fs_limit_bytes|container_cpu_cfs_throttled_seconds_total|container_network_receive_bytes_total|container_network_transmit_bytes_total)
              source_labels:
              - __name__
            - action: drop
              regex: (?:container_cpu_usage_seconds_total|container_memory_working_set_bytes|container_fs_usage_bytes|container_fs_limit_bytes);$
              source_labels:
              - __name__
              - container
            - action: labelmap
              regex: container_name
              replacement: container
            - action: drop
              regex: POD
              source_labels:
              - container
            - action: labeldrop
              regex: (id|name)
            metrics_path: /metrics/cadvisor
            relabel_configs:
            - replacement: https-metrics
              target_label: endpoint
            - action: replace
              source_labels:
              - __metrics_path__
              target_label: metrics_path
            - action: replace
              source_labels:
              - __address__
              target_label: instance
            scheme: https
            tls_config:
              ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
              insecure_skip_verify: true
    service:
      extensions:
      - health_check
      - pprof
      - file_storage
      pipelines:
        metrics:
          exporters:
          - otlphttp
          processors:
          - batch
          - filter/drop_stale_datapoints
          - transform/extract_sum_count_from_histograms
          - transform/drop_unnecessary_attributes
          receivers:
          - prometheus
      telemetry:
        logs:
          level: info
        metrics:
          address: 0.0.0.0:8888
Tiremisu commented 1 month ago

okay, can you also link the collector configuration?

Will update tmr when back to office. A little mess up about how the config will the TA/operator pass to the metric collector

jaronoff97 commented 1 month ago

i think your issue is that you are not escaping the $ in your otel configuration, please refer to the prometheusreceiver configuration here

Tiremisu commented 1 month ago

using collector config: image

jaronoff97 commented 1 month ago

the configuration you linked above doesn't match the configuration in that screenshot. Also if you can, please avoid sending screenshots of config, working with text is much easier for debugging / reproduction purposes.

astryia commented 4 weeks ago

I think, this behavior of target allocator causes "out of order sample" errors on prometheusremotewrite in my setup. I have replica set of 3 collectors and looks like allocator assign the same target (with different port) to the different collectors.

Here is an example of istiod pod which exposes 3 ports: http://allocator:8080/jobs/kubernetes-pods/targets?collector_id=adot-amp-collector-1

{
    "targets": [
      "10.4.29.97:8080"
    ],
    "labels": {
      "__meta_kubernetes_pod_annotation_sidecar_istio_io_inject": "false",
      "__meta_kubernetes_pod_label_istio_io_rev": "default",
      "__meta_kubernetes_pod_name": "istiod-8497d4fb88-bltf8",
      "__meta_kubernetes_pod_labelpresent_install_operator_istio_io_owning_resource": "true",
      "__meta_kubernetes_pod_container_port_number": "8080",
      "__meta_kubernetes_pod_container_port_name": "",
      "__meta_kubernetes_namespace": "istio-system",
      "__meta_kubernetes_pod_host_ip": "10.4.27.160",
      "__meta_kubernetes_pod_phase": "Running",
      "__meta_kubernetes_pod_labelpresent_operator_istio_io_component": "true",
      "__meta_kubernetes_pod_annotationpresent_prometheus_io_port": "true",
      "__meta_kubernetes_pod_ready": "true",
      "__meta_kubernetes_pod_controller_kind": "ReplicaSet",
      "__meta_kubernetes_pod_container_port_protocol": "TCP",
      "__meta_kubernetes_pod_ip": "10.4.29.97",
      "__meta_kubernetes_pod_labelpresent_istio_io_rev": "true",
      "__meta_kubernetes_pod_label_operator_istio_io_component": "Pilot",
      "__meta_kubernetes_pod_label_istio": "pilot",
      "__address__": "10.4.29.97:8080",
      "__meta_kubernetes_pod_label_app": "istiod",
      "__meta_kubernetes_pod_annotation_prometheus_io_port": "15014",
      "__meta_kubernetes_pod_label_install_operator_istio_io_owning_resource": "unknown",
      "__meta_kubernetes_pod_labelpresent_istio": "true",
      "__meta_kubernetes_pod_controller_name": "istiod-8497d4fb88",
      "__meta_kubernetes_pod_label_sidecar_istio_io_inject": "false",
      "__meta_kubernetes_pod_labelpresent_pod_template_hash": "true",
      "__meta_kubernetes_pod_container_init": "false",
      "__meta_kubernetes_pod_node_name": "ip-10-4-27-160.ec2.internal",
      "__meta_kubernetes_pod_annotationpresent_sidecar_istio_io_inject": "true",
      "__meta_kubernetes_pod_labelpresent_app": "true",
      "__meta_kubernetes_pod_label_pod_template_hash": "8497d4fb88",
      "__meta_kubernetes_pod_annotationpresent_prometheus_io_scrape": "true",
      "__meta_kubernetes_pod_container_name": "discovery",
      "__meta_kubernetes_pod_uid": "4df268f4-2a2f-4e44-9fad-187e764cbe2e",
      "__meta_kubernetes_pod_annotation_prometheus_io_scrape": "true",
      "__meta_kubernetes_pod_container_id": "containerd://9f74e6c936018d96394dc019febbe7104677b621c74d097ad0751ab373c75754",
      "__meta_kubernetes_pod_container_image": "docker.io/istio/pilot:1.16.2",
      "__meta_kubernetes_pod_annotationpresent_kubernetes_io_psp": "true",
      "__meta_kubernetes_pod_annotation_kubernetes_io_psp": "eks.privileged",
      "__meta_kubernetes_pod_labelpresent_sidecar_istio_io_inject": "true"
    }
  }

http://allocator:8080/jobs/kubernetes-pods/targets?collector_id=adot-amp-collector-2

{
    "targets": [
      "10.4.29.97:15010"
    ],
    "labels": {
      "__meta_kubernetes_pod_node_name": "ip-10-4-27-160.ec2.internal",
      "__meta_kubernetes_pod_label_sidecar_istio_io_inject": "false",
      "__meta_kubernetes_pod_annotation_kubernetes_io_psp": "eks.privileged",
      "__meta_kubernetes_pod_container_port_number": "15010",
      "__meta_kubernetes_pod_name": "istiod-8497d4fb88-bltf8",
      "__meta_kubernetes_pod_annotation_sidecar_istio_io_inject": "false",
      "__meta_kubernetes_pod_label_istio_io_rev": "default",
      "__meta_kubernetes_pod_labelpresent_istio_io_rev": "true",
      "__meta_kubernetes_pod_container_name": "discovery",
      "__meta_kubernetes_pod_annotationpresent_prometheus_io_scrape": "true",
      "__meta_kubernetes_pod_labelpresent_app": "true",
      "__meta_kubernetes_pod_ready": "true",
      "__meta_kubernetes_pod_label_istio": "pilot",
      "__meta_kubernetes_pod_annotation_prometheus_io_scrape": "true",
      "__meta_kubernetes_pod_annotationpresent_prometheus_io_port": "true",
      "__meta_kubernetes_pod_phase": "Running",
      "__meta_kubernetes_pod_labelpresent_sidecar_istio_io_inject": "true",
      "__meta_kubernetes_pod_container_id": "containerd://9f74e6c936018d96394dc019febbe7104677b621c74d097ad0751ab373c75754",
      "__meta_kubernetes_pod_container_port_protocol": "TCP",
      "__meta_kubernetes_pod_label_install_operator_istio_io_owning_resource": "unknown",
      "__meta_kubernetes_pod_label_operator_istio_io_component": "Pilot",
      "__meta_kubernetes_namespace": "istio-system",
      "__meta_kubernetes_pod_label_app": "istiod",
      "__meta_kubernetes_pod_container_init": "false",
      "__meta_kubernetes_pod_container_port_name": "",
      "__meta_kubernetes_pod_labelpresent_install_operator_istio_io_owning_resource": "true",
      "__meta_kubernetes_pod_annotation_prometheus_io_port": "15014",
      "__meta_kubernetes_pod_uid": "4df268f4-2a2f-4e44-9fad-187e764cbe2e",
      "__meta_kubernetes_pod_controller_kind": "ReplicaSet",
      "__meta_kubernetes_pod_controller_name": "istiod-8497d4fb88",
      "__meta_kubernetes_pod_host_ip": "10.4.27.160",
      "__address__": "10.4.29.97:15010",
      "__meta_kubernetes_pod_container_image": "docker.io/istio/pilot:1.16.2",
      "__meta_kubernetes_pod_annotationpresent_kubernetes_io_psp": "true",
      "__meta_kubernetes_pod_labelpresent_istio": "true",
      "__meta_kubernetes_pod_labelpresent_operator_istio_io_component": "true",
      "__meta_kubernetes_pod_annotationpresent_sidecar_istio_io_inject": "true",
      "__meta_kubernetes_pod_label_pod_template_hash": "8497d4fb88",
      "__meta_kubernetes_pod_labelpresent_pod_template_hash": "true",
      "__meta_kubernetes_pod_ip": "10.4.29.97"
    }
  },
  {
    "targets": [
      "10.4.29.97:15017"
    ],
    "labels": {
      "__meta_kubernetes_pod_host_ip": "10.4.27.160",
      "__meta_kubernetes_pod_phase": "Running",
      "__meta_kubernetes_pod_labelpresent_sidecar_istio_io_inject": "true",
      "__meta_kubernetes_pod_annotationpresent_kubernetes_io_psp": "true",
      "__meta_kubernetes_pod_container_port_name": "",
      "__meta_kubernetes_pod_name": "istiod-8497d4fb88-bltf8",
      "__meta_kubernetes_pod_node_name": "ip-10-4-27-160.ec2.internal",
      "__meta_kubernetes_pod_annotation_prometheus_io_port": "15014",
      "__meta_kubernetes_pod_container_port_protocol": "TCP",
      "__meta_kubernetes_pod_labelpresent_istio": "true",
      "__meta_kubernetes_pod_ready": "true",
      "__meta_kubernetes_pod_labelpresent_pod_template_hash": "true",
      "__meta_kubernetes_pod_ip": "10.4.29.97",
      "__meta_kubernetes_pod_annotationpresent_prometheus_io_scrape": "true",
      "__meta_kubernetes_pod_label_pod_template_hash": "8497d4fb88",
      "__meta_kubernetes_pod_controller_name": "istiod-8497d4fb88",
      "__meta_kubernetes_pod_annotation_prometheus_io_scrape": "true",
      "__meta_kubernetes_pod_container_image": "docker.io/istio/pilot:1.16.2",
      "__meta_kubernetes_pod_annotationpresent_prometheus_io_port": "true",
      "__meta_kubernetes_pod_uid": "4df268f4-2a2f-4e44-9fad-187e764cbe2e",
      "__meta_kubernetes_namespace": "istio-system",
      "__meta_kubernetes_pod_container_port_number": "15017",
      "__meta_kubernetes_pod_container_name": "discovery",
      "__meta_kubernetes_pod_container_id": "containerd://9f74e6c936018d96394dc019febbe7104677b621c74d097ad0751ab373c75754",
      "__meta_kubernetes_pod_controller_kind": "ReplicaSet",
      "__meta_kubernetes_pod_annotation_kubernetes_io_psp": "eks.privileged",
      "__meta_kubernetes_pod_labelpresent_app": "true",
      "__meta_kubernetes_pod_labelpresent_install_operator_istio_io_owning_resource": "true",
      "__meta_kubernetes_pod_label_install_operator_istio_io_owning_resource": "unknown",
      "__meta_kubernetes_pod_label_operator_istio_io_component": "Pilot",
      "__meta_kubernetes_pod_labelpresent_operator_istio_io_component": "true",
      "__meta_kubernetes_pod_annotationpresent_sidecar_istio_io_inject": "true",
      "__meta_kubernetes_pod_annotation_sidecar_istio_io_inject": "false",
      "__meta_kubernetes_pod_container_init": "false",
      "__address__": "10.4.29.97:15017",
      "__meta_kubernetes_pod_label_istio": "pilot",
      "__meta_kubernetes_pod_label_sidecar_istio_io_inject": "false",
      "__meta_kubernetes_pod_label_app": "istiod",
      "__meta_kubernetes_pod_label_istio_io_rev": "default",
      "__meta_kubernetes_pod_labelpresent_istio_io_rev": "true"
    }
  }
jaronoff97 commented 1 week ago

@astryia i don't think that would be causing an out of order error unless each target (ip:port combo) results in the same exact time series. If you can reliably reproduce this, i would recommend opening a new issue. I'm going to close this issue for now as it seems to have gone stale.