influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.2k stars 5.52k forks source link

Prometheus plugin - Kafka metrics is missing meta labels as tags #11537

Open tomklapka opened 2 years ago

tomklapka commented 2 years ago

Relevant telegraf.conf

[[inputs.prometheus]]
      monitor_kubernetes_pods = false
      response_timeout = "5s"
      metric_version = 2
      kubernetes_services = [
        "http://kafka-metrics.default:9308/metrics",
        "http://kafka-jmx-metrics.default:5556/metrics"
        ]
      bearer_token = "/run/secrets/kubernetes.io/serviceaccount/token"
      insecure_skip_verify = true

Logs from Telegraf

2022-07-19T14:37:11Z I! Starting Telegraf 1.23.2
2022-07-19T14:37:11Z I! Loaded inputs: internal prometheus
2022-07-19T14:37:11Z I! Loaded aggregators: 
2022-07-19T14:37:11Z I! Loaded processors: 
2022-07-19T14:37:11Z I! Loaded outputs: influxdb_v2 (2x)
2022-07-19T14:37:11Z I! Tags enabled: host=telegraf-prometheus-kafka-77d6b657c-zvmxl
2022-07-19T14:37:11Z I! [agent] Config: Interval:30s, Quiet:false, Hostname:"telegraf-prometheus-kafka-77d6b657c-zvmxl", Flush Interval:10s
2022-07-19T14:37:11Z D! [agent] Initializing plugins
2022-07-19T14:37:11Z D! [agent] Connecting outputs
2022-07-19T14:37:11Z D! [agent] Attempting connection to [outputs.influxdb_v2]
2022-07-19T14:37:11Z D! [agent] Successfully connected to outputs.influxdb_v2
2022-07-19T14:37:11Z D! [agent] Attempting connection to [outputs.influxdb_v2]
2022-07-19T14:37:11Z D! [agent] Successfully connected to outputs.influxdb_v2
2022-07-19T14:37:11Z D! [agent] Starting service inputs
2022-07-19T14:37:11Z I! Config watcher started
2022-07-19T14:37:21Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 50000 metrics
2022-07-19T14:37:21Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 50000 metrics
2022-07-19T14:37:31Z D! [outputs.influxdb_v2] Wrote batch of 50 metrics in 122.566406ms
2022-07-19T14:37:31Z D! [outputs.influxdb_v2] Buffer fullness: 199 / 50000 metrics
2022-07-19T14:37:31Z D! [outputs.influxdb_v2] Wrote batch of 50 metrics in 453.601643ms
2022-07-19T14:37:31Z D! [outputs.influxdb_v2] Buffer fullness: 199 / 50000 metrics
2022-07-19T14:37:41Z D! [outputs.influxdb_v2] Wrote batch of 199 metrics in 29.847702ms
2022-07-19T14:37:41Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 50000 metrics
2022-07-19T14:37:41Z D! [outputs.influxdb_v2] Wrote batch of 199 metrics in 231.098462ms
2022-07-19T14:37:41Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 50000 metrics
2022-07-19T14:37:51Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 50000 metrics
2022-07-19T14:37:51Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 50000 metrics
2022-07-19T14:38:01Z D! [outputs.influxdb_v2] Wrote batch of 247 metrics in 13.764863ms
2022-07-19T14:38:01Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 50000 metrics
2022-07-19T14:38:01Z D! [outputs.influxdb_v2] Wrote batch of 247 metrics in 233.323289ms
2022-07-19T14:38:01Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 50000 metrics
2022-07-19T14:38:11Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 50000 metrics

System info

Telegraf 1.23.2, AWS EKS 1.20, Kafka installed via Bitnami Helm chart

Docker

No response

Steps to reproduce

Metric example I got from Prometheus server via ServiceMonitor:

kafka_controller_controllerchannelmanager_queuesize_value{broker_id="1", container="jmx-exporter", endpoint="http-metrics", instance="172.30.2.222:5556", job="kafka-jmx-metrics", namespace="default", pod="kafka-1", service="kafka-jmx-metrics"}

Prometheus metric example I got from the kafka metric exporter service endpoint:

\# HELP kafka_controller_controllerchannelmanager_totalqueuesize_value Attribute exposed for management kafka.controller:name=TotalQueueSize,type=ControllerChannelManager,attribute=Value
\# TYPE kafka_controller_controllerchannelmanager_totalqueuesize_value untyped
kafka_controller_controllerchannelmanager_totalqueuesize_value 0.0

Influxdb lineprotocol example from telegraf I got:

prometheus,address=10.100.4.63,host=telegraf-prometheus-kafka-77d6b657c-zvmxl,url=http://kafka-jmx-metrics.default:5556/metrics kafka_controller_controllerchannelmanager_totalqueuesize_value=0 1556813561098000000

Expected behavior

Additional tags like broker_id="1", container="jmx-exporter", endpoint="http-metrics", instance="172.30.2.222:5556", job="kafka-jmx-metrics", namespace="default", pod="kafka-1", service="kafka-jmx-metrics".

Actual behavior

I'm missing additional meta tags like broker_id="1", container="jmx-exporter", endpoint="http-metrics", instance="172.30.2.222:5556", job="kafka-jmx-metrics", namespace="default", pod="kafka-1", service="kafka-jmx-metrics". With current output I'm not able to distinguish between different Kafka brokers and all broker metrics are mixed together.

Is it possible to add prometheus meta labels as tags in prometheus plugin? Maybe it can be done with some configuration which I've missed.

Additional info

kafka-metrics-exporter-list.txt

powersj commented 2 years ago

Hi,

Hmm our readme says:

...tags are created for each label.

I went looking and it does look like our prometehus parser will read labels. If I dump a similar metric in a file:

kafka_controller_controllerchannelmanager_queuesize_value{broker_id="1", container="jmx-exporter", endpoint="http-metrics", instance="172.30.2.222:5556", job="kafka-jmx-metrics", namespace="default", pod="kafka-1", service="kafka-jmx-metrics"} 3

And then read the file using the Prometheus data format:

[agent]
  omit_hostname = true

[[outputs.file]]

[[inputs.file]]
  files = ["data.json"]
  data_format = "prometheus"

I get a metric that includes all the labels:

prometheus,broker_id=1,container=jmx-exporter,endpoint=http-metrics,instance=172.30.2.222:5556,job=kafka-jmx-metrics,namespace=default,pod=kafka-1,service=kafka-jmx-metrics kafka_controller_controllerchannelmanager_queuesize_value=3 1658436936000000000

I then hosted that file and used the Prometehus input plugin to read it:

[agent]
  omit_hostname = true

[[outputs.file]]

[[inputs.prometheus]]
  urls = ["http://localhost:8000/metrics.out"]
  metric_version = 2

And got a similar metric with those same tags + the URL tag

prometheus,broker_id=1,container=jmx-exporter,endpoint=http-metrics,instance=172.30.2.222:5556,job=kafka-jmx-metrics,namespace=default,pod=kafka-1,service=kafka-jmx-metrics,url=http://localhost:8000/metrics.out kafka_controller_controllerchannelmanager_queuesize_value=3 1658437291000000000

The difference is you are using kubernetes_services, which uses a slightly different logic to get the URLs to scrape. All URLs end up eventually in the same slice used to collect at each Gather interval. The logic there is the same for each.

There is some difference with how tags are handled for pods, but that is not used in your config.

Are you using any processors? Does your full config have any taginclude or tagexclude options?

edit: I also tried a few examples from your attached kafka-metrics-exporter-list.txt, like kafka_exporter_build_info and promhttp_metric_handler_requests_total=4853 and those reported correctly with the labels.

tomklapka commented 2 years ago

Hi Joshua, I looked into it more deeply and discovered that Prometheus uses service discovery meta labels for internal (re)labeling purposes. This answers the existence of additional labels in Prometheus which are not exposed in Kafka's exporter metrics endpoint and therefore they are not consumed by Telegraf plugin. It would be nice to have such mechanism in Prometheus plugin, because without this it can be impossible to distinguish between different metric sources (e.g. Kafka replicas) when scraping metrics from single endpoint (e.g. service).

image

powersj commented 2 years ago

This answers the existence of additional labels in Prometheus which are not exposed in Kafka's exporter metrics endpoint and therefore they are not consumed by Telegraf plugin.

I am not sure I follow this statement, so please confirm if I am following along:

It sounds like the source of your metrics is a Kafka exporter producing Prometheus metrics for consumption. You are using the Telegraf Prometheus input with service discovery, so one URL in your Telegraf config can find multiple Kafka exporter's URLs. These Kafka exporters do not have any labels in the metrics (I think that is what the screenshot shows?) When Telegraf grabs these metrics from multiple Kafka exporters, there is no clear way to determine which metric belongs to which Kafka exporter because the only tag is the url where the service was originally discovered from?

After writing this, this feels like a better fit for using the kube_inventory input plugin to scape the K8s meta metrics.

tomklapka commented 2 years ago

Source of metrics is Kafka installed via Bitnami Helm chart. It uses services to expose metrics endpoints - one redirects traffic from/to Kafka exporter pod, second one to jmx exporter as a sidecar container. Kafka exporter pod can have multiple broker/replica endpoints configured. When Telegraf grabs metrics from the exporter service it has only 3 specific telegraf/plugin tags assigned - url, host, address. The challenge is to distinguish between these measurements - from which broker/replica they came from. Prometheus itself can do it.