Closed wanlonghenry closed 2 months ago
Let's take a look at each log line you provided:
dial tcp 10.15.218.113:9102: connect: connection refused
That is not a Telegraf issue. Your endpoint is not accepting connections. What would you expect Telegraf do about this?
"http://10.15.218.96:15020/stats/prometheus": reading text format failed: text format parsing error in line 757: invalid metric name
You need to go load that file up, go to line 757 and read what the metric name is and figure out why it is invalid. Probably not a Telegraf issue.
context deadline exceeded (Client.Timeout exceeded while awaiting headers)
This is the rest of the log lines. We have a comment about this in our FAQ. Namely something in your networking is having issues. You went and upgraded your networking, so I have a good idea of where you might go looking.
Nothing from the above points to a required change in Telegraf. If you do think something needs an update, can you please:
Thanks @wanlonghenry for raising this. We (well the developers in our business that I support) are experiencing this issue but let me clarify a few things and provide some more context:
[[inputs.prometheus]]
urls = ["http://127.0.0.1:8090/"]
[[outputs.prometheus_client]]
listen = ":8080"
path = "/metrics"
collectors_exclude = ["gocollector","process"]
When we see an error similar to "text format parsing error in line 807: invalid metric nametext format parsing error in line 807: invalid metric name", we can curl the endpoint and see that this line just has "# HELP process_start_time_seconds Telegraf Collected metric"
Example metrics attached here (: prom-metrics-merge.txt
As far as I can tell there are no special characters on that line.
Prior to this line there are other help lines but for some reason it has an error with this particular one. If we pipe the output to "promtool check metrics" - it is able to parse the metrics. It does have some warnings about some of the Istio metrics not having help text - but I don't think those would cause an invalid metric name error, also all the metrics it has warnings for are Istio ones which are being scraped.
We did try changing the telegraf config for the workload to use metrics version 1 / 2 but that doesn't seem to help. To be honest it's really hard to figure out what the issue is, whether it's Istio, the telegraf sidecar, the AMA telegraf or a combination of these and the settings associated with them.
The problem is about the error "text format parsing error in line 757: invalid metric nametext format parsing error in line 757: invalid metric name" Example metrics attached here (:
Thank you for clarifying what the issue is and providing the prometheus metrics. Is the line number and corresponding metric name the same for every deployment? Or does it vary? Is it always the last line of the file?
Prior to this line there are other help lines but for some reason it has an error with this particular one. If we pipe the output to "promtool check metrics" - it is able to parse the metrics.
Telegraf users the upstream Prometheus library to parse the data. In this case, invalid metric name
comes from github.com/prometheus/common here. A valid metric name is required to match [a-zA-Z_:][a-zA-Z0-9_:]*
.
It does have some warnings about some of the Istio metrics not having help text - but I don't think those would cause an invalid metric name error, also all the metrics it has warnings for are Istio ones which are being scraped.
Agreed, not having help text would not stop Telegraf from reading the metrics. You can try this and should still see the metrics.
If I use the following config:
[agent]
debug = true
omit_hostname = true
[[inputs.prometheus]]
urls = ["http://127.0.0.1:8000/prom-metrics-merge.txt"]
[[outputs.file]]
I have tried parsing the metrics you provided with both v1.28.5 as well as master and neither produce any errors or warnings. Using your same config with the Prometheus Client output also resolves the lines as expected:
[agent]
debug = true
omit_hostname = true
[[inputs.prometheus]]
urls = ["http://127.0.0.1:8000/prom-metrics-merge.txt"]
[[outputs.prometheus_client]]
listen = ":8080"
path = "/metrics"
collectors_exclude = ["gocollector","process"]
$ ../telegraf-builds/telegraf-v1.28.5 --config config.toml
2024-06-06T14:04:49Z I! Loading config: config.toml
2024-06-06T14:04:49Z I! Starting Telegraf 1.28.5 brought to you by InfluxData the makers of InfluxDB
2024-06-06T14:04:49Z I! Available plugins: 240 inputs, 9 aggregators, 29 processors, 24 parsers, 59 outputs, 5 secret-stores
2024-06-06T14:04:49Z I! Loaded inputs: prometheus
2024-06-06T14:04:49Z I! Loaded aggregators:
2024-06-06T14:04:49Z I! Loaded processors:
2024-06-06T14:04:49Z I! Loaded secretstores:
2024-06-06T14:04:49Z I! Loaded outputs: prometheus_client
2024-06-06T14:04:49Z I! Tags enabled:
2024-06-06T14:04:49Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"", Flush Interval:10s
2024-06-06T14:04:49Z D! [agent] Initializing plugins
2024-06-06T14:04:49Z I! [inputs.prometheus] Using the label selector: and field selector:
2024-06-06T14:04:49Z D! [agent] Connecting outputs
2024-06-06T14:04:49Z D! [agent] Attempting connection to [outputs.prometheus_client]
2024-06-06T14:04:49Z I! [outputs.prometheus_client] Listening on http://[::]:8080/metrics
2024-06-06T14:04:49Z D! [agent] Successfully connected to outputs.prometheus_client
2024-06-06T14:04:49Z D! [agent] Starting service inputs
2024-06-06T14:04:59Z D! [outputs.prometheus_client] Wrote batch of 278 metrics in 1.984879ms
2024-06-06T14:04:59Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2024-06-06T14:05:09Z D! [outputs.prometheus_client] Wrote batch of 278 metrics in 1.671008ms
2024-06-06T14:05:09Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Forums or provide additional details in this issue and reqeust that it be re-opened. Thank you!
Relevant telegraf.conf
Logs from Telegraf
System info
Telegraf 1.28.5 Istio 1.21
Docker
No response
Steps to reproduce
Expected behavior
All metrics are collected when using Istio and Prometheus merge.
Actual behavior
With Prometheus merge enabled and when using with Istio version 1.21+, some metrics were missing. There were some compatibility issues for Telegraf with Istio 1.21+ which would generate errors including format, parsing and so on.
Additional info
No response