Closed mhoyer closed 1 year ago
I am not sure if it's a general issue with multiple aliased instances of this (or even all) input plugins. Or is it just related to kubernetes kind of usage of the plugin instances using monitor_kubernetes_pods
?
I know it may be odd, but does setting "GOMAXPROCS=1" env var change anything for you?
Or better could you check with the nightly build, please: https://github.com/influxdata/telegraf/blob/master/docs/NIGHTLIES.md ? There is an unreleased change which might be fixing the issue you see.
quay.io/influxdb/telegraf-nightly:alpine
(@sha256:d04031
) did not work. Instead I now also see other issues popping up:
time="2023-06-08T11:53:24Z" level=warning msg="The \"NODE_IP\" variable is not set. Defaulting to a blank string."
and
2023-06-08T11:53:30Z E! [inputs.kube_inventory::in.kube_inventory] Error in plugin: resourcequotas is forbidden: User "system:serviceaccount:xxx:telegraf" cannot list resource "resourcequotas" in API group "" at the cluster scope
2023-06-08T11:53:30Z E! [inputs.kube_inventory::in.kube_inventory] Error in plugin: secrets is forbidden: User "system:serviceaccount:xxx:telegraf" cannot list resource "secrets" in API group "" at the cluster scope
Full log with this version:
2023-06-08T11:53:24Z I! Loading config: /etc/telegraf/telegraf.conf
time="2023-06-08T11:53:24Z" level=warning msg="The \"NODE_IP\" variable is not set. Defaulting to a blank string."
time="2023-06-08T11:53:24Z" level=warning msg="The \"NODE_IP\" variable is not set. Defaulting to a blank string."
time="2023-06-08T11:53:24Z" level=warning msg="The \"NODE_IP\" variable is not set. Defaulting to a blank string."
time="2023-06-08T11:53:24Z" level=warning msg="The \"NODE_IP\" variable is not set. Defaulting to a blank string."
2023-06-08T11:53:24Z I! Starting Telegraf 1.27.0-ebe34610
2023-06-08T11:53:24Z I! Available plugins: 236 inputs, 9 aggregators, 28 processors, 23 parsers, 58 outputs, 4 secret-stores
2023-06-08T11:53:24Z I! Loaded inputs: influxdb_v2_listener internal kube_inventory kubernetes prometheus (3x)
2023-06-08T11:53:24Z I! Loaded aggregators:
2023-06-08T11:53:24Z I! Loaded processors:
2023-06-08T11:53:24Z I! Loaded secretstores:
2023-06-08T11:53:24Z I! Loaded outputs: influxdb_v2
2023-06-08T11:53:24Z I! Tags enabled: host=telegraf-5d6855587d-xqhjt k8s_cluster=playground-dev
2023-06-08T11:53:24Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"telegraf-5d6855587d-xqhjt", Flush Interval:20s
2023-06-08T11:53:24Z D! [agent] Initializing plugins
2023-06-08T11:53:24Z D! [agent] Connecting outputs
2023-06-08T11:53:24Z D! [agent] Attempting connection to [outputs.influxdb_v2::out.iflx_v2]
2023-06-08T11:53:24Z D! [agent] Successfully connected to outputs.influxdb_v2::out.iflx_v2
2023-06-08T11:53:24Z D! [agent] Starting service inputs
2023-06-08T11:53:24Z I! [inputs.influxdb_v2_listener::in.influxdb_v2_listener] Started HTTP listener service on :8086
2023-06-08T11:53:24Z D! [inputs.prometheus::in.prometheus-v1@default] will scrape metrics from "http://10.177.104.2:8080/metrics"
2023-06-08T11:53:24Z D! [inputs.prometheus::in.prometheus-v1@default] will scrape metrics from "http://10.177.104.40:8080/metrics"
2023-06-08T11:53:24Z D! [inputs.prometheus::in.prometheus-v1@default] will scrape metrics from "http://10.177.104.32:8080/metrics"
2023-06-08T11:53:30Z E! [inputs.kube_inventory::in.kube_inventory] Error in plugin: resourcequotas is forbidden: User "system:serviceaccount:xxx:telegraf" cannot list resource "resourcequotas" in API group "" at the cluster scope
2023-06-08T11:53:30Z E! [inputs.kube_inventory::in.kube_inventory] Error in plugin: secrets is forbidden: User "system:serviceaccount:xxx:telegraf" cannot list resource "secrets" in API group "" at the cluster scope
2023-06-08T11:53:30Z D! [outputs.influxdb_v2::out.iflx_v2] Wrote batch of 200 metrics in 41.414384ms
2023-06-08T11:53:30Z D! [outputs.influxdb_v2::out.iflx_v2] Buffer fullness: 295 / 200000 metrics
2023-06-08T11:53:33Z E! [inputs.prometheus::in.prometheus-v1@default] Error in plugin: error making HTTP request to "http://10.177.104.2:8080/metrics": Get "http://10.177.104.2:8080/metrics": dial tcp 10.177.104.2:8080: connect: no route to host
2023-06-08T11:53:40Z E! [inputs.kube_inventory::in.kube_inventory] Error in plugin: resourcequotas is forbidden: User "system:serviceaccount:xxx:telegraf" cannot list resource "resourcequotas" in API group "" at the cluster scope
2023-06-08T11:53:40Z E! [inputs.kube_inventory::in.kube_inventory] Error in plugin: secrets is forbidden: User "system:serviceaccount:xxx:telegraf" cannot list resource "secrets" in API group "" at the cluster scope
2023-06-08T11:53:40Z D! [outputs.influxdb_v2::out.iflx_v2] Wrote batch of 200 metrics in 15.64543ms
2023-06-08T11:53:40Z D! [outputs.influxdb_v2::out.iflx_v2] Buffer fullness: 71 / 200000 metrics
Still you see in the logs only one of the three input.prometheus
plugins starting to scrape anything. Though, now it's not in the in.prometheus-v1@kyverno
aliased one, but the in.prometheus-v1@default
instance.
I will give GOMAXPROCS
a try now.
↑ with the non-alpine quay.io/influxdb/telegraf-nightly:latest
(@sha256:f49d9a
) it look very similar in the logs.
I wasn't able to reproduce it with 1.26.3 and we use 7 prometheus plugins.
Could you try adding monitor_kubernetes_pods_method = "settings+annotations"
? This will make telegraf print log message during plugin initialization like following:
2023-06-08T12:51:13Z I! [inputs.prometheus::kube-state-metrics] Using the label selector: app=kube-state-metrics and field selector:
for each plugin instance. this way we'd know that each plugin is at least initialized.
Also could you make a goroutine dump sometime after interval
(thats 10 seconds in your case). For that send SIGQUIT to telegraf process, for example:
kubectl exec telegraf-78d786f54-x5zt7 -- kill -QUIT 1
I now also tried the GOMAXPROCS=1
environment variable → no success.
Next attempt is your suggestion with monitor_kubernetes_pods_method = "settings+annotations"
.
Next attempt is your suggestion with monitor_kubernetes_pods_method = "settings+annotations".
It shouldn't change behaviour, just confirms that plugins are initialized
This is the logs output with Telegraf 1.26.3
and monitor_kubernetes_pods_method = "settings+annotations"
enabled:
2023-06-08T14:50:47Z I! Loading config: /etc/telegraf/telegraf.conf
2023-06-08T14:50:47Z I! Starting Telegraf 1.26.3
2023-06-08T14:50:47Z I! Available plugins: 235 inputs, 9 aggregators, 27 processors, 22 parsers, 57 outputs, 2 secret-stores
2023-06-08T14:50:47Z I! Loaded inputs: influxdb_v2_listener internal kube_inventory kubernetes prometheus (3x)
2023-06-08T14:50:47Z I! Loaded aggregators:
2023-06-08T14:50:47Z I! Loaded processors:
2023-06-08T14:50:47Z I! Loaded secretstores:
2023-06-08T14:50:47Z I! Loaded outputs: influxdb_v2
2023-06-08T14:50:47Z I! Tags enabled: host=telegraf-6f79bf567f-rxm9b k8s_cluster=playground-dev
2023-06-08T14:50:47Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"telegraf-6f79bf567f-rxm9b", Flush Interval:20s
2023-06-08T14:50:47Z D! [agent] Initializing plugins
2023-06-08T14:50:47Z I! [inputs.prometheus::in.prometheus-v1@default] Using the label selector: and field selector: status.phase=Running
2023-06-08T14:50:47Z I! [inputs.prometheus::in.prometheus-v1@argocd] Using the label selector: and field selector: status.phase=Running
2023-06-08T14:50:47Z I! [inputs.prometheus::in.prometheus-v1@kyverno] Using the label selector: and field selector: status.phase=Running
2023-06-08T14:50:47Z D! [agent] Connecting outputs
2023-06-08T14:50:47Z D! [agent] Attempting connection to [outputs.influxdb_v2::out.iflx_v2]
2023-06-08T14:50:47Z D! [agent] Successfully connected to outputs.influxdb_v2::out.iflx_v2
2023-06-08T14:50:47Z D! [agent] Starting service inputs
2023-06-08T14:50:47Z I! [inputs.influxdb_v2_listener::in.influxdb_v2_listener] Started HTTP listener service on :8086
2023-06-08T14:50:48Z D! [inputs.prometheus::in.prometheus-v1@kyverno] will scrape metrics from "http://10.182.136.3:8000/metrics"
2023-06-08T14:50:48Z D! [inputs.prometheus::in.prometheus-v1@kyverno] will scrape metrics from "http://10.182.136.6:9102/metrics"
2023-06-08T14:50:50Z E! [inputs.prometheus::in.prometheus-v1@kyverno] Error in plugin: error making HTTP request to "http://10.182.136.6:9102/metrics": Get "http://10.182.136.6:9102/metrics": dial tcp 10.182.136.6:9102: connect: connection refused
2023-06-08T14:50:50Z D! [outputs.influxdb_v2::out.iflx_v2] Wrote batch of 200 metrics in 34.90943ms
2023-06-08T14:50:50Z D! [outputs.influxdb_v2::out.iflx_v2] Buffer fullness: 49 / 200000 metrics
2023-06-08T14:50:50Z D! [outputs.influxdb_v2::out.iflx_v2] Wrote batch of 200 metrics in 35.43929ms
2023-06-08T14:50:50Z D! [outputs.influxdb_v2::out.iflx_v2] Buffer fullness: 100 / 200000 metrics
2023-06-08T14:50:50Z D! [outputs.influxdb_v2::out.iflx_v2] Wrote batch of 200 metrics in 44.891682ms
2023-06-08T14:50:50Z D! [outputs.influxdb_v2::out.iflx_v2] Buffer fullness: 41 / 200000 metrics
2023-06-08T14:51:00Z E! [inputs.prometheus::in.prometheus-v1@kyverno] Error in plugin: error making HTTP request to "http://10.182.136.6:9102/metrics": Get "http://10.182.136.6:9102/metrics": dial tcp 10.182.136.6:9102: connect: connection refused
2023-06-08T14:51:00Z D! [outputs.influxdb_v2::out.iflx_v2] Wrote batch of 200 metrics in 67.004594ms
2023-06-08T14:51:00Z D! [outputs.influxdb_v2::out.iflx_v2] Buffer fullness: 198 / 200000 metrics
I now also ran the kill -QUIT 1
and dumped the output to the attached file. The Telegraf instance 1.26.3
was running for around 2min.
Can it be related to alpine
(which we use) vs. non-alpine
?
PS: don't be confused about four instances of inputs.prometheus
in the dump. It's just another fourth namespace
that was monitored in this example.
I wasn't able to reproduce it with 1.26.3 and we use 7 prometheus plugins.
Strange, that you can not reproduce it. Do you need any further information from my side? Would it be an option to add more logging output to the plugin?
In the meantime I also tried to reduce complexity of our configuration. But even with this simplistic configuration it's just scraping metrics from one plugin. I also noticed that slight changes to the config sometimes make the one or the other plugin become the chosen one. It seems to be kind of "random" if default
or kyverno
one is "picked":
[agent]
debug = true
quiet = false
logtarget = "stderr"
omit_hostname = true
[[inputs.prometheus]]
alias = "in.prometheus-v1@default"
monitor_kubernetes_pods = true
monitor_kubernetes_pods_namespace = "default"
node_ip = "$NODE_IP" # to avoid INFO logs "The config node_ip is empty or invalid. Using NODE_IP env var as default."
bearer_token = "/var/run/secrets/kubernetes.io/serviceaccount/token"
[[inputs.prometheus]]
alias = "in.prometheus-v1@kyverno"
monitor_kubernetes_pods = true
monitor_kubernetes_pods_namespace = "kyverno"
node_ip = "$NODE_IP" # to avoid INFO logs "The config node_ip is empty or invalid. Using NODE_IP env var as default."
bearer_token = "/var/run/secrets/kubernetes.io/serviceaccount/token"
[[outputs.file]]
These are the concrete kubernetes Deployment
and ConfigMap
files I played with just now:
---
kind: ConfigMap
apiVersion: v1
metadata:
name: telegraf-issue
namespace: default
data:
telegraf.conf: |
[agent]
debug = true
quiet = false
logtarget = "stderr"
omit_hostname = true
[[inputs.prometheus]]
alias = "in.prometheus-v1@default"
monitor_kubernetes_pods = true
monitor_kubernetes_pods_namespace = "default"
node_ip = "$NODE_IP" # to avoid INFO logs "The config node_ip is empty or invalid. Using NODE_IP env var as default."
bearer_token = "/var/run/secrets/kubernetes.io/serviceaccount/token"
[[inputs.prometheus]]
alias = "in.prometheus-v1@kyverno"
monitor_kubernetes_pods = true
monitor_kubernetes_pods_namespace = "kyverno"
node_ip = "$NODE_IP" # to avoid INFO logs "The config node_ip is empty or invalid. Using NODE_IP env var as default."
bearer_token = "/var/run/secrets/kubernetes.io/serviceaccount/token"
[[outputs.file]]
---
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
reloader.stakater.com/auto: "true"
name: telegraf-issue
namespace: default
spec:
replicas: 1
revisionHistoryLimit: 0
selector:
matchLabels:
app.kubernetes.io/name: telegraf-issue
template:
metadata:
labels:
app.kubernetes.io/name: telegraf-issue
spec:
containers:
- command: [ "telegraf", "--config", "/etc/telegraf/telegraf.conf" ]
image: telegraf:1.26.3-alpine
imagePullPolicy: Always
name: telegraf
resources:
requests: { cpu: 10m, memory: 100Mi }
limits: { memory: 100Mi }
volumeMounts:
- mountPath: /etc/telegraf
name: telegraf-config
restartPolicy: Always
volumes:
- configMap:
defaultMode: 420
name: telegraf-issue
name: telegraf-config
Our kubernetes cluster is based on TKGI and runs (still) as v1.23.17
.
Thanks for the effort to create minimal reproducer. One difference between our setups is that I don't have a case for monitor_kubernetes_pods_namespace
being different. I'll prepare something to test.
in a meantime, could you try running telegraf without monitor_kubernetes_pods_namespace
? You'll need following in each config:
monitor_kubernetes_pods_method = "settings+annotations" # to make field selector work
kubernetes_field_selector = "metadata.namespace=kyverno" # to filter by namespace, alter for other plugin instances
That must be it, I am certain behaviour you see is because https://github.com/influxdata/telegraf/pull/13063 doesn't account for multiple namespaces. It limits all shared informers to the namespace of the plugin which happened to initialize global InformerFactory first.
I think correct way to handle is to be correct first, that is respect namespace per plugin instance and being optimized second, that is use shared informers.
To do that I suggest following changes:
namespace = ""
key in global shared informers mapFwiw: i just tested locally with 2 podman containres running prometheus and 2 prometheus input plugins and version 1.27 of telegraf and it collected from both instances; I think this issue is related to something kubernetes-y.
Edit: just noticed @redbaron s comment; that shraed informer looks racy as its only initialized once with a namespace, right? https://github.com/influxdata/telegraf/blob/36906fcee3adc8aab7b9d622d4eab97a7c6ef6c9/plugins/inputs/prometheus/kubernetes.go#L144
Relevant telegraf.conf
Logs from Telegraf
System info
Telegraf 1.26.3
Docker
No response
Steps to reproduce
inputs.prometheus
plugins)Expected behavior
All
inputs.prometheus
instances should scrape.Actual behavior
To me it looks like only the most recent instance of the
inputs.prometheus
plugin in the list of configuration items is actually active.With
--debug
enabled it's easy to already see in the logs when looking for "will scrape metrics from" lines. Hence, there's only a single entry for thein.prometheus-v1@kyverno
aliasedinputs.prometheus
plugin:Switching back to Telegraf 1.25.3 with the exact same
telegraf.conf
produces more log entries (and of course written metrics to our InfluxDB) as expected:↑ See full log of
1.25.3
below. ↓Additional info
Here in comparison the logs of a
1.25.3
instance of Telegraf with the exact sametelegraf.conf
: