Closed Tiremisu closed 1 week ago
Can you provide some more details from the target allocator, including the response in the /jobs endpoint and logs? It would also be useful to give a reproducible example so we can confirm the bug/behavior.
/job:
targets discover:
target allocator logs (no errors) ...successfully started a collector pod watcher ...successfully started a collector pod watcher ...successfully started a collector pod watcher
Collector logs:
![image](https://github.com/user-attachments/assets/01597ffc-6d8e-49ee-b82c-3e4d8f06029e)
This is expected as the TA is not returning the relabel config in its response to the collector (though we internally may run the relabel config) so it's expected for the collector to do this itself. Are you seeing the collector failing to scrape the expected targets?
@jaronoff97 I mean the TA does not discover the correct port/endpoint! Let's say in the screenshot above, the TA should discover endpoint 10.12.195.160:9091 instead of 10.12.195.160:9900. We expose /metrics at 9091. So when the TA pass this wrong endpoint to metric collector, it says failed to the prometheus endpoint.
Target discover should find the correct prometheus endpoint, right?
@jaronoff97 I mean the TA does not discover the correct port/endpoint! Let's say in the screenshot above, the TA should discover endpoint 10.12.195.160:9091 instead of 10.12.195.160:9900. We expose /metrics at 9091. So when the TA pass this wrong endpoint to metric collector, it says failed to the prometheus endpoint.
Target discover should find the correct prometheus endpoint, right?
It does discover the correct port and endpoint. You later want to modify what it discovered, and use the port from the annotation instead of the port exposed by the container. This is currently not done by the target allocator, but rather the collector. The collector will get the target labels as exposed by the target allocator, apply your relabelling steps, and then actually scrape the target.
Unless the collector is not actually scraping the correct targets, this is working as intended.
As an aside, we would like to make the target allocator work the way you expect it to, but this is unfortunately more complicated than one would naively expect.
@swiatekm Thanks for the explanation. Will try to add relabeling steps. Could you also link me the guide?
the collector should already be doing the relabel config you have provided, i know you linked the target allocator config above, but could you also link the collector CRD you are using?
the collector should already be doing the relabel config you have provided, i know you linked the target allocator config above, but could you also link the collector CRD you are using?
I am using opentelemetry-operator 0.98.0
okay, can you also link the collector configuration?
exporters:
otlphttp:
disable_keep_alives: true
endpoint: http://${METADATA_METRICS_SVC}.${NAMESPACE}.svc.cluster.local.:4318
sending_queue:
num_consumers: 10
queue_size: 10000
storage: file_storage
extensions:
file_storage:
compaction:
directory: /tmp
on_rebound: true
directory: /var/lib/storage/otc
timeout: 10s
health_check: {}
pprof: {}
processors:
batch:
send_batch_max_size: 2000
send_batch_size: 1000
timeout: 1s
filter/drop_stale_datapoints:
metrics:
datapoint:
- flags == FLAG_NO_RECORDED_VALUE
transform/drop_unnecessary_attributes:
error_mode: ignore
metric_statements:
- context: resource
statements:
- delete_key(attributes, "http.scheme")
- delete_key(attributes, "net.host.name")
- delete_key(attributes, "net.host.port")
- delete_key(attributes, "service.instance.id")
- delete_matching_keys(attributes, "k8s.*")
transform/extract_sum_count_from_histograms:
error_mode: ignore
metric_statements:
- context: metric
statements:
- extract_sum_metric(true) where IsMatch(name, "^(apiserver_request_duration_seconds|coredns_dns_request_duration_seconds|kubelet_runtime_operations_duration_seconds)$")
- extract_count_metric(true) where IsMatch(name, "^(apiserver_request_duration_seconds|coredns_dns_request_duration_seconds|kubelet_runtime_operations_duration_seconds)$")
receivers:
prometheus:
config:
global:
scrape_interval: 30s
scrape_configs:
- job_name: pod-annotations
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scrape
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_pod_annotation_prometheus_io_port
target_label: __address__
- action: replace
regex: (.*)
replacement: $1
separator: ;
source_labels:
- __metrics_path__
target_label: endpoint
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: namespace
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- action: replace
regex: (.*)
replacement: $1
separator: ;
source_labels:
- __meta_kubernetes_pod_name
target_label: pod
- authorization:
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
honor_labels: true
job_name: kubelet
kubernetes_sd_configs:
- role: node
metric_relabel_configs:
- action: keep
regex: (?:kubelet_docker_operations_errors(?:|_total)|kubelet_(?:docker|runtime)_operations_duration_seconds_(?:count|sum)|kubelet_running_(?:container|pod)(?:_count|s)|kubelet_(:?docker|runtime)_operations_latency_microseconds(?:|_count|_sum))
source_labels:
- __name__
- action: labeldrop
regex: id
relabel_configs:
- source_labels:
- __meta_kubernetes_node_name
target_label: node
- replacement: https-metrics
target_label: endpoint
- action: replace
source_labels:
- __metrics_path__
target_label: metrics_path
- action: replace
source_labels:
- __address__
target_label: instance
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
- authorization:
credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
honor_labels: true
job_name: cadvisor
kubernetes_sd_configs:
- role: node
metric_relabel_configs:
- action: replace
regex: .*
replacement: kubelet
source_labels:
- __name__
target_label: job
- action: keep
regex: (?:container_cpu_usage_seconds_total|container_memory_working_set_bytes|container_fs_usage_bytes|container_fs_limit_bytes|container_cpu_cfs_throttled_seconds_total|container_network_receive_bytes_total|container_network_transmit_bytes_total)
source_labels:
- __name__
- action: drop
regex: (?:container_cpu_usage_seconds_total|container_memory_working_set_bytes|container_fs_usage_bytes|container_fs_limit_bytes);$
source_labels:
- __name__
- container
- action: labelmap
regex: container_name
replacement: container
- action: drop
regex: POD
source_labels:
- container
- action: labeldrop
regex: (id|name)
metrics_path: /metrics/cadvisor
relabel_configs:
- replacement: https-metrics
target_label: endpoint
- action: replace
source_labels:
- __metrics_path__
target_label: metrics_path
- action: replace
source_labels:
- __address__
target_label: instance
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
service:
extensions:
- health_check
- pprof
- file_storage
pipelines:
metrics:
exporters:
- otlphttp
processors:
- batch
- filter/drop_stale_datapoints
- transform/extract_sum_count_from_histograms
- transform/drop_unnecessary_attributes
receivers:
- prometheus
telemetry:
logs:
level: info
metrics:
address: 0.0.0.0:8888
okay, can you also link the collector configuration?
Will update tmr when back to office. A little mess up about how the config will the TA/operator pass to the metric collector
i think your issue is that you are not escaping the $ in your otel configuration, please refer to the prometheusreceiver configuration here
using collector config:
the configuration you linked above doesn't match the configuration in that screenshot. Also if you can, please avoid sending screenshots of config, working with text is much easier for debugging / reproduction purposes.
I think, this behavior of target allocator causes "out of order sample" errors on prometheusremotewrite in my setup. I have replica set of 3 collectors and looks like allocator assign the same target (with different port) to the different collectors.
Here is an example of istiod pod which exposes 3 ports: http://allocator:8080/jobs/kubernetes-pods/targets?collector_id=adot-amp-collector-1
{
"targets": [
"10.4.29.97:8080"
],
"labels": {
"__meta_kubernetes_pod_annotation_sidecar_istio_io_inject": "false",
"__meta_kubernetes_pod_label_istio_io_rev": "default",
"__meta_kubernetes_pod_name": "istiod-8497d4fb88-bltf8",
"__meta_kubernetes_pod_labelpresent_install_operator_istio_io_owning_resource": "true",
"__meta_kubernetes_pod_container_port_number": "8080",
"__meta_kubernetes_pod_container_port_name": "",
"__meta_kubernetes_namespace": "istio-system",
"__meta_kubernetes_pod_host_ip": "10.4.27.160",
"__meta_kubernetes_pod_phase": "Running",
"__meta_kubernetes_pod_labelpresent_operator_istio_io_component": "true",
"__meta_kubernetes_pod_annotationpresent_prometheus_io_port": "true",
"__meta_kubernetes_pod_ready": "true",
"__meta_kubernetes_pod_controller_kind": "ReplicaSet",
"__meta_kubernetes_pod_container_port_protocol": "TCP",
"__meta_kubernetes_pod_ip": "10.4.29.97",
"__meta_kubernetes_pod_labelpresent_istio_io_rev": "true",
"__meta_kubernetes_pod_label_operator_istio_io_component": "Pilot",
"__meta_kubernetes_pod_label_istio": "pilot",
"__address__": "10.4.29.97:8080",
"__meta_kubernetes_pod_label_app": "istiod",
"__meta_kubernetes_pod_annotation_prometheus_io_port": "15014",
"__meta_kubernetes_pod_label_install_operator_istio_io_owning_resource": "unknown",
"__meta_kubernetes_pod_labelpresent_istio": "true",
"__meta_kubernetes_pod_controller_name": "istiod-8497d4fb88",
"__meta_kubernetes_pod_label_sidecar_istio_io_inject": "false",
"__meta_kubernetes_pod_labelpresent_pod_template_hash": "true",
"__meta_kubernetes_pod_container_init": "false",
"__meta_kubernetes_pod_node_name": "ip-10-4-27-160.ec2.internal",
"__meta_kubernetes_pod_annotationpresent_sidecar_istio_io_inject": "true",
"__meta_kubernetes_pod_labelpresent_app": "true",
"__meta_kubernetes_pod_label_pod_template_hash": "8497d4fb88",
"__meta_kubernetes_pod_annotationpresent_prometheus_io_scrape": "true",
"__meta_kubernetes_pod_container_name": "discovery",
"__meta_kubernetes_pod_uid": "4df268f4-2a2f-4e44-9fad-187e764cbe2e",
"__meta_kubernetes_pod_annotation_prometheus_io_scrape": "true",
"__meta_kubernetes_pod_container_id": "containerd://9f74e6c936018d96394dc019febbe7104677b621c74d097ad0751ab373c75754",
"__meta_kubernetes_pod_container_image": "docker.io/istio/pilot:1.16.2",
"__meta_kubernetes_pod_annotationpresent_kubernetes_io_psp": "true",
"__meta_kubernetes_pod_annotation_kubernetes_io_psp": "eks.privileged",
"__meta_kubernetes_pod_labelpresent_sidecar_istio_io_inject": "true"
}
}
http://allocator:8080/jobs/kubernetes-pods/targets?collector_id=adot-amp-collector-2
{
"targets": [
"10.4.29.97:15010"
],
"labels": {
"__meta_kubernetes_pod_node_name": "ip-10-4-27-160.ec2.internal",
"__meta_kubernetes_pod_label_sidecar_istio_io_inject": "false",
"__meta_kubernetes_pod_annotation_kubernetes_io_psp": "eks.privileged",
"__meta_kubernetes_pod_container_port_number": "15010",
"__meta_kubernetes_pod_name": "istiod-8497d4fb88-bltf8",
"__meta_kubernetes_pod_annotation_sidecar_istio_io_inject": "false",
"__meta_kubernetes_pod_label_istio_io_rev": "default",
"__meta_kubernetes_pod_labelpresent_istio_io_rev": "true",
"__meta_kubernetes_pod_container_name": "discovery",
"__meta_kubernetes_pod_annotationpresent_prometheus_io_scrape": "true",
"__meta_kubernetes_pod_labelpresent_app": "true",
"__meta_kubernetes_pod_ready": "true",
"__meta_kubernetes_pod_label_istio": "pilot",
"__meta_kubernetes_pod_annotation_prometheus_io_scrape": "true",
"__meta_kubernetes_pod_annotationpresent_prometheus_io_port": "true",
"__meta_kubernetes_pod_phase": "Running",
"__meta_kubernetes_pod_labelpresent_sidecar_istio_io_inject": "true",
"__meta_kubernetes_pod_container_id": "containerd://9f74e6c936018d96394dc019febbe7104677b621c74d097ad0751ab373c75754",
"__meta_kubernetes_pod_container_port_protocol": "TCP",
"__meta_kubernetes_pod_label_install_operator_istio_io_owning_resource": "unknown",
"__meta_kubernetes_pod_label_operator_istio_io_component": "Pilot",
"__meta_kubernetes_namespace": "istio-system",
"__meta_kubernetes_pod_label_app": "istiod",
"__meta_kubernetes_pod_container_init": "false",
"__meta_kubernetes_pod_container_port_name": "",
"__meta_kubernetes_pod_labelpresent_install_operator_istio_io_owning_resource": "true",
"__meta_kubernetes_pod_annotation_prometheus_io_port": "15014",
"__meta_kubernetes_pod_uid": "4df268f4-2a2f-4e44-9fad-187e764cbe2e",
"__meta_kubernetes_pod_controller_kind": "ReplicaSet",
"__meta_kubernetes_pod_controller_name": "istiod-8497d4fb88",
"__meta_kubernetes_pod_host_ip": "10.4.27.160",
"__address__": "10.4.29.97:15010",
"__meta_kubernetes_pod_container_image": "docker.io/istio/pilot:1.16.2",
"__meta_kubernetes_pod_annotationpresent_kubernetes_io_psp": "true",
"__meta_kubernetes_pod_labelpresent_istio": "true",
"__meta_kubernetes_pod_labelpresent_operator_istio_io_component": "true",
"__meta_kubernetes_pod_annotationpresent_sidecar_istio_io_inject": "true",
"__meta_kubernetes_pod_label_pod_template_hash": "8497d4fb88",
"__meta_kubernetes_pod_labelpresent_pod_template_hash": "true",
"__meta_kubernetes_pod_ip": "10.4.29.97"
}
},
{
"targets": [
"10.4.29.97:15017"
],
"labels": {
"__meta_kubernetes_pod_host_ip": "10.4.27.160",
"__meta_kubernetes_pod_phase": "Running",
"__meta_kubernetes_pod_labelpresent_sidecar_istio_io_inject": "true",
"__meta_kubernetes_pod_annotationpresent_kubernetes_io_psp": "true",
"__meta_kubernetes_pod_container_port_name": "",
"__meta_kubernetes_pod_name": "istiod-8497d4fb88-bltf8",
"__meta_kubernetes_pod_node_name": "ip-10-4-27-160.ec2.internal",
"__meta_kubernetes_pod_annotation_prometheus_io_port": "15014",
"__meta_kubernetes_pod_container_port_protocol": "TCP",
"__meta_kubernetes_pod_labelpresent_istio": "true",
"__meta_kubernetes_pod_ready": "true",
"__meta_kubernetes_pod_labelpresent_pod_template_hash": "true",
"__meta_kubernetes_pod_ip": "10.4.29.97",
"__meta_kubernetes_pod_annotationpresent_prometheus_io_scrape": "true",
"__meta_kubernetes_pod_label_pod_template_hash": "8497d4fb88",
"__meta_kubernetes_pod_controller_name": "istiod-8497d4fb88",
"__meta_kubernetes_pod_annotation_prometheus_io_scrape": "true",
"__meta_kubernetes_pod_container_image": "docker.io/istio/pilot:1.16.2",
"__meta_kubernetes_pod_annotationpresent_prometheus_io_port": "true",
"__meta_kubernetes_pod_uid": "4df268f4-2a2f-4e44-9fad-187e764cbe2e",
"__meta_kubernetes_namespace": "istio-system",
"__meta_kubernetes_pod_container_port_number": "15017",
"__meta_kubernetes_pod_container_name": "discovery",
"__meta_kubernetes_pod_container_id": "containerd://9f74e6c936018d96394dc019febbe7104677b621c74d097ad0751ab373c75754",
"__meta_kubernetes_pod_controller_kind": "ReplicaSet",
"__meta_kubernetes_pod_annotation_kubernetes_io_psp": "eks.privileged",
"__meta_kubernetes_pod_labelpresent_app": "true",
"__meta_kubernetes_pod_labelpresent_install_operator_istio_io_owning_resource": "true",
"__meta_kubernetes_pod_label_install_operator_istio_io_owning_resource": "unknown",
"__meta_kubernetes_pod_label_operator_istio_io_component": "Pilot",
"__meta_kubernetes_pod_labelpresent_operator_istio_io_component": "true",
"__meta_kubernetes_pod_annotationpresent_sidecar_istio_io_inject": "true",
"__meta_kubernetes_pod_annotation_sidecar_istio_io_inject": "false",
"__meta_kubernetes_pod_container_init": "false",
"__address__": "10.4.29.97:15017",
"__meta_kubernetes_pod_label_istio": "pilot",
"__meta_kubernetes_pod_label_sidecar_istio_io_inject": "false",
"__meta_kubernetes_pod_label_app": "istiod",
"__meta_kubernetes_pod_label_istio_io_rev": "default",
"__meta_kubernetes_pod_labelpresent_istio_io_rev": "true"
}
}
@astryia i don't think that would be causing an out of order error unless each target (ip:port combo) results in the same exact time series. If you can reliably reproduce this, i would recommend opening a new issue. I'm going to close this issue for now as it seems to have gone stale.
Component(s)
target allocator
What happened?
Description
target allocator discover ignore the address in prometheus receiver config
Steps to Reproduce
config the target allocator with following
Expected Result
TA will discover that targets with port in __meta_kubernetes_pod_annotation_prometheus_io_port, eg, 9113 instead of discover targets with ports expose by the container, like 8080
Actual Result
it discover all targets with ports exposed by the container, eg, if the pod expose 14250, 4317, but the metric port is 9113, targets discover are:
Kubernetes Version
1.28.7
Operator version
0.49.0/
Collector version
adot-operator-targetallocator:0.94.1
Environment information
Environment
OS: (e.g., "Ubuntu 20.04") Compiler(if manually compiled): (e.g., "go 14.2")
Log output
Additional context
We deployed two operators, one is adot-operator and the other one is opentelemetry-operator, current using is opentelemetry-operator with aws images...