influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.62k stars 5.58k forks source link

[inputs.prometheus] Only the last instance of plugin is enabled since Telegraf 1.26 #13405

Closed mhoyer closed 1 year ago

mhoyer commented 1 year ago

Relevant telegraf.conf

[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 200
  metric_buffer_limit = 200000
  collection_jitter = "1s"
  flush_interval = "20s"
  flush_jitter = "1s"
  precision = ""
  debug = true
  quiet = false
  logtarget = "stderr"
  hostname = ""
  omit_hostname = false

[global_tags]
  k8s_cluster = "playground-dev"

# → handler focusing on k8s resource states:
[[inputs.kube_inventory]]
  alias = "in.kube_inventory"
  url = "https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_PORT_443_TCP_PORT"
  namespace = ""
  tls_ca = "/run/secrets/kubernetes.io/serviceaccount/ca.crt"

  [inputs.kube_inventory.tags]
    target_influx_bucket = "k8s-cluster-info"

# → handler focusing on generic pod/container runtime metrics:
[[inputs.kubernetes]]
  alias = "in.kubernetes"
  bearer_token = "/var/run/secrets/kubernetes.io/serviceaccount/token"
  tls_ca = "/run/secrets/kubernetes.io/serviceaccount/ca.crt"

  [inputs.kubernetes.tags]
    target_influx_bucket = "k8s-cluster-info"

# → push handler (listener) for application specific metrics.
[[inputs.influxdb_v2_listener]]
  alias = "in.influxdb_v2_listener"
  service_address = ":8086"

  [inputs.influxdb_v2_listener.tags]
    target_influx_bucket = "k8s-playground"

# → The internal plugin collects metrics about the telegraf agent itself.
  [[inputs.internal]]
    [inputs.internal.tags]
      target_influx_bucket = "k8s-playground"

# → poll handler (scraper) for application specific metrics.

## Namespace "default"
[[inputs.prometheus]]
  alias = "in.prometheus-v1@default"
  monitor_kubernetes_pods = true
  monitor_kubernetes_pods_namespace = "default"

  node_ip = "$NODE_IP" # to avoid INFO logs "The config node_ip is empty or invalid. Using NODE_IP env var as default."
  pod_scrape_interval = 30
  bearer_token = "/var/run/secrets/kubernetes.io/serviceaccount/token"
  tagexclude = ["prometheus.io/*","vault.hashicorp.com/*","kubernetes.io/*"]

  [inputs.prometheus.tags]
    target_influx_bucket = "k8s-playground"

## Namespace "argocd"
[[inputs.prometheus]]
  alias = "in.prometheus-v1@argocd"
  monitor_kubernetes_pods = true
  monitor_kubernetes_pods_namespace = "argocd"

  node_ip = "$NODE_IP" # to avoid INFO logs "The config node_ip is empty or invalid. Using NODE_IP env var as default."
  pod_scrape_interval = 30
  bearer_token = "/var/run/secrets/kubernetes.io/serviceaccount/token"
  tagexclude = ["prometheus.io/*","vault.hashicorp.com/*","kubernetes.io/*"]

  [inputs.prometheus.tags]
    target_influx_bucket = "k8s-playground"

## Namespace "kyverno"
[[inputs.prometheus]]
  alias = "in.prometheus-v1@kyverno"
  monitor_kubernetes_pods = true
  monitor_kubernetes_pods_namespace = "kyverno"

  node_ip = "$NODE_IP" # to avoid INFO logs "The config node_ip is empty or invalid. Using NODE_IP env var as default."
  pod_scrape_interval = 30
  bearer_token = "/var/run/secrets/kubernetes.io/serviceaccount/token"
  tagexclude = ["prometheus.io/*","vault.hashicorp.com/*","kubernetes.io/*"]

  [inputs.prometheus.tags]
    target_influx_bucket = "k8s-playground"

[[outputs.influxdb_v2]]
  alias = "out.iflx_v2"
  urls = [ "http://influx-common-dev:8086" ]
  token = "${INFLUX_TOKEN}"
  organization = ""
  bucket_tag = "target_influx_bucket"
  exclude_bucket_tag = true
  timeout = "10s" # default is 5s
  tagexclude = ["address","host","pod-template-hash","url"]

Logs from Telegraf

2023-06-08T07:09:18Z I! Loading config: /etc/telegraf/telegraf.conf
2023-06-08T07:09:18Z I! Starting Telegraf 1.26.3
2023-06-08T07:09:18Z I! Available plugins: 235 inputs, 9 aggregators, 27 processors, 22 parsers, 57 outputs, 2 secret-stores
2023-06-08T07:09:18Z I! Loaded inputs: influxdb_v2_listener internal kube_inventory kubernetes prometheus (3x)
2023-06-08T07:09:18Z I! Loaded aggregators:
2023-06-08T07:09:18Z I! Loaded processors:
2023-06-08T07:09:18Z I! Loaded secretstores:
2023-06-08T07:09:18Z I! Loaded outputs: influxdb_v2
2023-06-08T07:09:18Z I! Tags enabled: host=telegraf-c69b69f47-5djkk k8s_cluster=playground-dev
2023-06-08T07:09:18Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"telegraf-c69b69f47-5djkk", Flush Interval:20s
2023-06-08T07:09:18Z D! [agent] Initializing plugins
2023-06-08T07:09:18Z D! [agent] Connecting outputs
2023-06-08T07:09:18Z D! [agent] Attempting connection to [outputs.influxdb_v2::out.iflx_v2]
2023-06-08T07:09:18Z D! [agent] Successfully connected to outputs.influxdb_v2::out.iflx_v2
2023-06-08T07:09:18Z D! [agent] Starting service inputs
2023-06-08T07:09:18Z I! [inputs.influxdb_v2_listener::in.influxdb_v2_listener] Started HTTP listener service on :8086
2023-06-08T07:09:19Z D! [inputs.prometheus::in.prometheus-v1@kyverno] will scrape metrics from "http://10.182.136.3:8000/metrics"
2023-06-08T07:09:20Z D! [outputs.influxdb_v2::out.iflx_v2] Wrote batch of 200 metrics in 31.845185ms
2023-06-08T07:09:20Z D! [outputs.influxdb_v2::out.iflx_v2] Buffer fullness: 167 / 200000 metrics

System info

Telegraf 1.26.3

Docker

No response

Steps to reproduce

  1. Use telegraf 1.26.3 in combination with my config above (incl. several instances of inputs.prometheus plugins)
  2. Check the logs (and actual written metrics to target output) about coverage of all prometheus scraping inputs

Expected behavior

All inputs.prometheus instances should scrape.

Actual behavior

To me it looks like only the most recent instance of the inputs.prometheus plugin in the list of configuration items is actually active.

With --debug enabled it's easy to already see in the logs when looking for "will scrape metrics from" lines. Hence, there's only a single entry for the in.prometheus-v1@kyverno aliased inputs.prometheus plugin:

D! [inputs.prometheus::in.prometheus-v1@kyverno] will scrape metrics from "http://10.182.136.3:8000/metrics"

Switching back to Telegraf 1.25.3 with the exact same telegraf.conf produces more log entries (and of course written metrics to our InfluxDB) as expected:

D! [inputs.prometheus::in.prometheus-v1@default] will scrape metrics from "http://10.177.104.40:8080/metrics"
D! [inputs.prometheus::in.prometheus-v1@argocd] will scrape metrics from "http://10.180.139.14:8083/metrics"
D! [inputs.prometheus::in.prometheus-v1@kyverno] will scrape metrics from "http://10.182.136.3:8000/metrics"
D! [inputs.prometheus::in.prometheus-v1@default] will scrape metrics from "http://10.177.104.32:8080/metrics"
D! [inputs.prometheus::in.prometheus-v1@argocd] will scrape metrics from "http://10.180.139.26:8082/metrics"
D! [inputs.prometheus::in.prometheus-v1@default] will scrape metrics from "http://10.177.104.2:8080/metrics"

↑ See full log of 1.25.3 below. ↓

Additional info

Here in comparison the logs of a 1.25.3 instance of Telegraf with the exact same telegraf.conf:

2023-06-08T07:31:47Z I! Starting Telegraf 1.25.3
2023-06-08T07:31:47Z I! Available plugins: 228 inputs, 9 aggregators, 26 processors, 21 parsers, 57 outputs, 2 secret-stores
2023-06-08T07:31:47Z I! Loaded inputs: influxdb_v2_listener internal kube_inventory kubernetes prometheus (3x)
2023-06-08T07:31:47Z I! Loaded aggregators:
2023-06-08T07:31:47Z I! Loaded processors:
2023-06-08T07:31:47Z I! Loaded secretstores:
2023-06-08T07:31:47Z I! Loaded outputs: influxdb_v2
2023-06-08T07:31:47Z I! Tags enabled: host=telegraf-85c4856d84-62mfb k8s_cluster=playground-dev
2023-06-08T07:31:47Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"telegraf-85c4856d84-62mfb", Flush Interval:20s
2023-06-08T07:31:47Z D! [agent] Initializing plugins
2023-06-08T07:31:47Z D! [agent] Connecting outputs
2023-06-08T07:31:47Z D! [agent] Attempting connection to [outputs.influxdb_v2::out.iflx_v2]
2023-06-08T07:31:47Z D! [agent] Successfully connected to outputs.influxdb_v2::out.iflx_v2
2023-06-08T07:31:47Z D! [agent] Starting service inputs
2023-06-08T07:31:47Z I! [inputs.influxdb_v2_listener::in.influxdb_v2_listener] Started HTTP listener service on :8086
2023-06-08T07:31:48Z D! [inputs.prometheus::in.prometheus-v1@default] will scrape metrics from "http://10.177.104.40:8080/metrics"
2023-06-08T07:31:48Z D! [inputs.prometheus::in.prometheus-v1@argocd] will scrape metrics from "http://10.180.139.14:8083/metrics"
2023-06-08T07:31:48Z D! [inputs.prometheus::in.prometheus-v1@kyverno] will scrape metrics from "http://10.182.136.3:8000/metrics"
2023-06-08T07:31:48Z D! [inputs.prometheus::in.prometheus-v1@default] will scrape metrics from "http://10.177.104.32:8080/metrics"
2023-06-08T07:31:48Z D! [inputs.prometheus::in.prometheus-v1@argocd] will scrape metrics from "http://10.180.139.26:8082/metrics"
2023-06-08T07:31:48Z D! [inputs.prometheus::in.prometheus-v1@default] will scrape metrics from "http://10.177.104.2:8080/metrics"
2023-06-08T07:31:50Z D! [outputs.influxdb_v2::out.iflx_v2] Wrote batch of 200 metrics in 46.169943ms
2023-06-08T07:31:50Z D! [outputs.influxdb_v2::out.iflx_v2] Buffer fullness: 183 / 200000 metrics
mhoyer commented 1 year ago

I am not sure if it's a general issue with multiple aliased instances of this (or even all) input plugins. Or is it just related to kubernetes kind of usage of the plugin instances using monitor_kubernetes_pods?

redbaron commented 1 year ago

I know it may be odd, but does setting "GOMAXPROCS=1" env var change anything for you?

Or better could you check with the nightly build, please: https://github.com/influxdata/telegraf/blob/master/docs/NIGHTLIES.md ? There is an unreleased change which might be fixing the issue you see.

mhoyer commented 1 year ago

quay.io/influxdb/telegraf-nightly:alpine (@sha256:d04031) did not work. Instead I now also see other issues popping up:

time="2023-06-08T11:53:24Z" level=warning msg="The \"NODE_IP\" variable is not set. Defaulting to a blank string."

and

2023-06-08T11:53:30Z E! [inputs.kube_inventory::in.kube_inventory] Error in plugin: resourcequotas is forbidden: User "system:serviceaccount:xxx:telegraf" cannot list resource "resourcequotas" in API group "" at the cluster scope
2023-06-08T11:53:30Z E! [inputs.kube_inventory::in.kube_inventory] Error in plugin: secrets is forbidden: User "system:serviceaccount:xxx:telegraf" cannot list resource "secrets" in API group "" at the cluster scope

Full log with this version:

2023-06-08T11:53:24Z I! Loading config: /etc/telegraf/telegraf.conf
time="2023-06-08T11:53:24Z" level=warning msg="The \"NODE_IP\" variable is not set. Defaulting to a blank string."
time="2023-06-08T11:53:24Z" level=warning msg="The \"NODE_IP\" variable is not set. Defaulting to a blank string."
time="2023-06-08T11:53:24Z" level=warning msg="The \"NODE_IP\" variable is not set. Defaulting to a blank string."
time="2023-06-08T11:53:24Z" level=warning msg="The \"NODE_IP\" variable is not set. Defaulting to a blank string."
2023-06-08T11:53:24Z I! Starting Telegraf 1.27.0-ebe34610
2023-06-08T11:53:24Z I! Available plugins: 236 inputs, 9 aggregators, 28 processors, 23 parsers, 58 outputs, 4 secret-stores
2023-06-08T11:53:24Z I! Loaded inputs: influxdb_v2_listener internal kube_inventory kubernetes prometheus (3x)
2023-06-08T11:53:24Z I! Loaded aggregators:
2023-06-08T11:53:24Z I! Loaded processors:
2023-06-08T11:53:24Z I! Loaded secretstores:
2023-06-08T11:53:24Z I! Loaded outputs: influxdb_v2
2023-06-08T11:53:24Z I! Tags enabled: host=telegraf-5d6855587d-xqhjt k8s_cluster=playground-dev
2023-06-08T11:53:24Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"telegraf-5d6855587d-xqhjt", Flush Interval:20s
2023-06-08T11:53:24Z D! [agent] Initializing plugins
2023-06-08T11:53:24Z D! [agent] Connecting outputs
2023-06-08T11:53:24Z D! [agent] Attempting connection to [outputs.influxdb_v2::out.iflx_v2]
2023-06-08T11:53:24Z D! [agent] Successfully connected to outputs.influxdb_v2::out.iflx_v2
2023-06-08T11:53:24Z D! [agent] Starting service inputs
2023-06-08T11:53:24Z I! [inputs.influxdb_v2_listener::in.influxdb_v2_listener] Started HTTP listener service on :8086
2023-06-08T11:53:24Z D! [inputs.prometheus::in.prometheus-v1@default] will scrape metrics from "http://10.177.104.2:8080/metrics"
2023-06-08T11:53:24Z D! [inputs.prometheus::in.prometheus-v1@default] will scrape metrics from "http://10.177.104.40:8080/metrics"
2023-06-08T11:53:24Z D! [inputs.prometheus::in.prometheus-v1@default] will scrape metrics from "http://10.177.104.32:8080/metrics"
2023-06-08T11:53:30Z E! [inputs.kube_inventory::in.kube_inventory] Error in plugin: resourcequotas is forbidden: User "system:serviceaccount:xxx:telegraf" cannot list resource "resourcequotas" in API group "" at the cluster scope
2023-06-08T11:53:30Z E! [inputs.kube_inventory::in.kube_inventory] Error in plugin: secrets is forbidden: User "system:serviceaccount:xxx:telegraf" cannot list resource "secrets" in API group "" at the cluster scope
2023-06-08T11:53:30Z D! [outputs.influxdb_v2::out.iflx_v2] Wrote batch of 200 metrics in 41.414384ms
2023-06-08T11:53:30Z D! [outputs.influxdb_v2::out.iflx_v2] Buffer fullness: 295 / 200000 metrics
2023-06-08T11:53:33Z E! [inputs.prometheus::in.prometheus-v1@default] Error in plugin: error making HTTP request to "http://10.177.104.2:8080/metrics": Get "http://10.177.104.2:8080/metrics": dial tcp 10.177.104.2:8080: connect: no route to host
2023-06-08T11:53:40Z E! [inputs.kube_inventory::in.kube_inventory] Error in plugin: resourcequotas is forbidden: User "system:serviceaccount:xxx:telegraf" cannot list resource "resourcequotas" in API group "" at the cluster scope
2023-06-08T11:53:40Z E! [inputs.kube_inventory::in.kube_inventory] Error in plugin: secrets is forbidden: User "system:serviceaccount:xxx:telegraf" cannot list resource "secrets" in API group "" at the cluster scope
2023-06-08T11:53:40Z D! [outputs.influxdb_v2::out.iflx_v2] Wrote batch of 200 metrics in 15.64543ms
2023-06-08T11:53:40Z D! [outputs.influxdb_v2::out.iflx_v2] Buffer fullness: 71 / 200000 metrics

Still you see in the logs only one of the three input.prometheus plugins starting to scrape anything. Though, now it's not in the in.prometheus-v1@kyverno aliased one, but the in.prometheus-v1@default instance.

I will give GOMAXPROCS a try now.

mhoyer commented 1 year ago

↑ with the non-alpine quay.io/influxdb/telegraf-nightly:latest (@sha256:f49d9a) it look very similar in the logs.

redbaron commented 1 year ago

I wasn't able to reproduce it with 1.26.3 and we use 7 prometheus plugins.

Could you try adding monitor_kubernetes_pods_method = "settings+annotations" ? This will make telegraf print log message during plugin initialization like following:

2023-06-08T12:51:13Z I! [inputs.prometheus::kube-state-metrics] Using the label selector: app=kube-state-metrics and field selector: 

for each plugin instance. this way we'd know that each plugin is at least initialized.

Also could you make a goroutine dump sometime after interval(thats 10 seconds in your case). For that send SIGQUIT to telegraf process, for example:

kubectl exec telegraf-78d786f54-x5zt7 -- kill -QUIT 1
mhoyer commented 1 year ago

I now also tried the GOMAXPROCS=1 environment variable → no success.

Next attempt is your suggestion with monitor_kubernetes_pods_method = "settings+annotations".

redbaron commented 1 year ago

Next attempt is your suggestion with monitor_kubernetes_pods_method = "settings+annotations".

It shouldn't change behaviour, just confirms that plugins are initialized

mhoyer commented 1 year ago

This is the logs output with Telegraf 1.26.3 and monitor_kubernetes_pods_method = "settings+annotations" enabled:

2023-06-08T14:50:47Z I! Loading config: /etc/telegraf/telegraf.conf
2023-06-08T14:50:47Z I! Starting Telegraf 1.26.3
2023-06-08T14:50:47Z I! Available plugins: 235 inputs, 9 aggregators, 27 processors, 22 parsers, 57 outputs, 2 secret-stores
2023-06-08T14:50:47Z I! Loaded inputs: influxdb_v2_listener internal kube_inventory kubernetes prometheus (3x)
2023-06-08T14:50:47Z I! Loaded aggregators:
2023-06-08T14:50:47Z I! Loaded processors:
2023-06-08T14:50:47Z I! Loaded secretstores:
2023-06-08T14:50:47Z I! Loaded outputs: influxdb_v2
2023-06-08T14:50:47Z I! Tags enabled: host=telegraf-6f79bf567f-rxm9b k8s_cluster=playground-dev
2023-06-08T14:50:47Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"telegraf-6f79bf567f-rxm9b", Flush Interval:20s
2023-06-08T14:50:47Z D! [agent] Initializing plugins
2023-06-08T14:50:47Z I! [inputs.prometheus::in.prometheus-v1@default] Using the label selector:  and field selector: status.phase=Running
2023-06-08T14:50:47Z I! [inputs.prometheus::in.prometheus-v1@argocd] Using the label selector:  and field selector: status.phase=Running
2023-06-08T14:50:47Z I! [inputs.prometheus::in.prometheus-v1@kyverno] Using the label selector:  and field selector: status.phase=Running
2023-06-08T14:50:47Z D! [agent] Connecting outputs
2023-06-08T14:50:47Z D! [agent] Attempting connection to [outputs.influxdb_v2::out.iflx_v2]
2023-06-08T14:50:47Z D! [agent] Successfully connected to outputs.influxdb_v2::out.iflx_v2
2023-06-08T14:50:47Z D! [agent] Starting service inputs
2023-06-08T14:50:47Z I! [inputs.influxdb_v2_listener::in.influxdb_v2_listener] Started HTTP listener service on :8086
2023-06-08T14:50:48Z D! [inputs.prometheus::in.prometheus-v1@kyverno] will scrape metrics from "http://10.182.136.3:8000/metrics"
2023-06-08T14:50:48Z D! [inputs.prometheus::in.prometheus-v1@kyverno] will scrape metrics from "http://10.182.136.6:9102/metrics"
2023-06-08T14:50:50Z E! [inputs.prometheus::in.prometheus-v1@kyverno] Error in plugin: error making HTTP request to "http://10.182.136.6:9102/metrics": Get "http://10.182.136.6:9102/metrics": dial tcp 10.182.136.6:9102: connect: connection refused
2023-06-08T14:50:50Z D! [outputs.influxdb_v2::out.iflx_v2] Wrote batch of 200 metrics in 34.90943ms
2023-06-08T14:50:50Z D! [outputs.influxdb_v2::out.iflx_v2] Buffer fullness: 49 / 200000 metrics
2023-06-08T14:50:50Z D! [outputs.influxdb_v2::out.iflx_v2] Wrote batch of 200 metrics in 35.43929ms
2023-06-08T14:50:50Z D! [outputs.influxdb_v2::out.iflx_v2] Buffer fullness: 100 / 200000 metrics
2023-06-08T14:50:50Z D! [outputs.influxdb_v2::out.iflx_v2] Wrote batch of 200 metrics in 44.891682ms
2023-06-08T14:50:50Z D! [outputs.influxdb_v2::out.iflx_v2] Buffer fullness: 41 / 200000 metrics
2023-06-08T14:51:00Z E! [inputs.prometheus::in.prometheus-v1@kyverno] Error in plugin: error making HTTP request to "http://10.182.136.6:9102/metrics": Get "http://10.182.136.6:9102/metrics": dial tcp 10.182.136.6:9102: connect: connection refused
2023-06-08T14:51:00Z D! [outputs.influxdb_v2::out.iflx_v2] Wrote batch of 200 metrics in 67.004594ms
2023-06-08T14:51:00Z D! [outputs.influxdb_v2::out.iflx_v2] Buffer fullness: 198 / 200000 metrics
mhoyer commented 1 year ago

I now also ran the kill -QUIT 1 and dumped the output to the attached file. The Telegraf instance 1.26.3 was running for around 2min.

telegraf.godump.txt

Can it be related to alpine (which we use) vs. non-alpine?

PS: don't be confused about four instances of inputs.prometheus in the dump. It's just another fourth namespace that was monitored in this example.

mhoyer commented 1 year ago

I wasn't able to reproduce it with 1.26.3 and we use 7 prometheus plugins.

Strange, that you can not reproduce it. Do you need any further information from my side? Would it be an option to add more logging output to the plugin?

In the meantime I also tried to reduce complexity of our configuration. But even with this simplistic configuration it's just scraping metrics from one plugin. I also noticed that slight changes to the config sometimes make the one or the other plugin become the chosen one. It seems to be kind of "random" if default or kyverno one is "picked":

[agent]
  debug = true
  quiet = false
  logtarget = "stderr"
  omit_hostname = true

[[inputs.prometheus]]
  alias = "in.prometheus-v1@default"
  monitor_kubernetes_pods = true
  monitor_kubernetes_pods_namespace = "default"

  node_ip = "$NODE_IP" # to avoid INFO logs "The config node_ip is empty or invalid. Using NODE_IP env var as default."
  bearer_token = "/var/run/secrets/kubernetes.io/serviceaccount/token"

[[inputs.prometheus]]
  alias = "in.prometheus-v1@kyverno"
  monitor_kubernetes_pods = true
  monitor_kubernetes_pods_namespace = "kyverno"

  node_ip = "$NODE_IP" # to avoid INFO logs "The config node_ip is empty or invalid. Using NODE_IP env var as default."
  bearer_token = "/var/run/secrets/kubernetes.io/serviceaccount/token"

[[outputs.file]]

These are the concrete kubernetes Deployment and ConfigMap files I played with just now:

---
kind: ConfigMap
apiVersion: v1
metadata:
  name: telegraf-issue
  namespace: default
data:
  telegraf.conf: |
    [agent]
      debug = true
      quiet = false
      logtarget = "stderr"
      omit_hostname = true

    [[inputs.prometheus]]
      alias = "in.prometheus-v1@default"
      monitor_kubernetes_pods = true
      monitor_kubernetes_pods_namespace = "default"

      node_ip = "$NODE_IP" # to avoid INFO logs "The config node_ip is empty or invalid. Using NODE_IP env var as default."
      bearer_token = "/var/run/secrets/kubernetes.io/serviceaccount/token"

    [[inputs.prometheus]]
      alias = "in.prometheus-v1@kyverno"
      monitor_kubernetes_pods = true
      monitor_kubernetes_pods_namespace = "kyverno"

      node_ip = "$NODE_IP" # to avoid INFO logs "The config node_ip is empty or invalid. Using NODE_IP env var as default."
      bearer_token = "/var/run/secrets/kubernetes.io/serviceaccount/token"

    [[outputs.file]]

---
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    reloader.stakater.com/auto: "true"
  name: telegraf-issue
  namespace: default
spec:
  replicas: 1
  revisionHistoryLimit: 0
  selector:
    matchLabels:
      app.kubernetes.io/name: telegraf-issue
  template:
    metadata:
      labels:
        app.kubernetes.io/name: telegraf-issue
    spec:
      containers:
      - command: [ "telegraf", "--config", "/etc/telegraf/telegraf.conf" ]
        image: telegraf:1.26.3-alpine
        imagePullPolicy: Always
        name: telegraf
        resources:
          requests: { cpu: 10m, memory: 100Mi }
          limits: { memory: 100Mi }
        volumeMounts:
          - mountPath: /etc/telegraf
            name: telegraf-config
      restartPolicy: Always
      volumes:
      - configMap:
          defaultMode: 420
          name: telegraf-issue
        name: telegraf-config

Our kubernetes cluster is based on TKGI and runs (still) as v1.23.17.

redbaron commented 1 year ago

Thanks for the effort to create minimal reproducer. One difference between our setups is that I don't have a case for monitor_kubernetes_pods_namespace being different. I'll prepare something to test.

in a meantime, could you try running telegraf without monitor_kubernetes_pods_namespace? You'll need following in each config:

monitor_kubernetes_pods_method = "settings+annotations"  # to make field selector work
kubernetes_field_selector = "metadata.namespace=kyverno" # to filter by namespace, alter for other plugin instances
redbaron commented 1 year ago

That must be it, I am certain behaviour you see is because https://github.com/influxdata/telegraf/pull/13063 doesn't account for multiple namespaces. It limits all shared informers to the namespace of the plugin which happened to initialize global InformerFactory first.

redbaron commented 1 year ago

I think correct way to handle is to be correct first, that is respect namespace per plugin instance and being optimized second, that is use shared informers.

To do that I suggest following changes:

mhoffm-aiven commented 1 year ago

Fwiw: i just tested locally with 2 podman containres running prometheus and 2 prometheus input plugins and version 1.27 of telegraf and it collected from both instances; I think this issue is related to something kubernetes-y.

Edit: just noticed @redbaron s comment; that shraed informer looks racy as its only initialized once with a namespace, right? https://github.com/influxdata/telegraf/blob/36906fcee3adc8aab7b9d622d4eab97a7c6ef6c9/plugins/inputs/prometheus/kubernetes.go#L144