internal metrics mixed together if no alias defined on output plugins

shric commented 3 months ago

Relevant telegraf.conf

[[outputs.http]]
url="http://one.example.com/write"

[[outputs.http]]
url="http://two.example.com/write"

Logs from Telegraf

Aug 23 05:05:35 foo.example.com telegraf[796261]: 2024-08-23T10:05:35Z D! [agent] Attempting connection to [outputs.http]
Aug 23 05:05:35 foo.example.com telegraf[796261]: 2024-08-23T10:05:35Z D! [agent] Successfully connected to outputs.http
Aug 23 05:05:35 foo.example.com telegraf[796261]: 2024-08-23T10:05:35Z D! [agent] Attempting connection to [outputs.prometheus_client]
Aug 23 05:05:35 foo.example.com telegraf[796261]: 2024-08-23T10:05:35Z I! [outputs.prometheus_client] Listening on https://0.0.0.0:9273/metrics
Aug 23 05:05:35 foo.example.com telegraf[796261]: 2024-08-23T10:05:35Z D! [agent] Successfully connected to outputs.prometheus_client
Aug 23 05:05:35 foo.example.com telegraf[796261]: 2024-08-23T10:05:35Z D! [agent] Starting service inputs
Aug 23 05:05:35 foo.example.com telegraf[796261]: 2024-08-23T10:05:35Z I! [inputs.socket_listener] Listening on tcp://0.0.0.0:8094
Aug 23 05:05:42 foo.example.com telegraf[796261]: 2024-08-23T10:05:42Z D! [outputs.http] Wrote batch of 8 metrics in 666.983697ms
Aug 23 05:05:42 foo.example.com telegraf[796261]: 2024-08-23T10:05:42Z D! [outputs.http] Buffer fullness: 0 / 100000 metrics
Aug 23 05:05:45 foo.example.com telegraf[796261]: 2024-08-23T10:05:45Z D! [outputs.http] Buffer fullness: 8 / 100000 metrics
Aug 23 05:05:45 foo.example.com telegraf[796261]: 2024-08-23T10:05:45Z E! [agent] Error writing to outputs.http: Post "https://two.example.com/write": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Aug 23 05:05:47 foo.example.com telegraf[796261]: 2024-08-23T10:05:47Z D! [outputs.http] Buffer fullness: 0 / 100000 metrics
Aug 23 05:05:51 foo.example.com telegraf[796261]: 2024-08-23T10:05:51Z D! [outputs.http] Buffer fullness: 8 / 100000 metrics
Aug 23 05:05:51 foo.example.com telegraf[796261]: 2024-08-23T10:05:51Z E! [agent] Error writing to outputs.http: Post "https://two.example.com/write": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

System info

telegraf-1.31.3-1.x86_64, Alma Linux 8.8

Docker

No response

Steps to reproduce

Add more than one http output per telegraf.conf above.
Enable telegraf internal metrics
Make one of the outputs an unreachable endpoint so that the internal buffer fills on one and not the other (to illustrate the bug in the metrics)

Expected behavior

Separate internal_write_buffer_size metrics for each outputs.http instance. An additional url tag, for example, could disambiguate.

Actual behavior

metrics such as internal_write_buffer_size{output="http"} will randomly report either 0 (for the reachable output) or a nonzero value (for the unreachable output) against the same metric.

Additional info

If you provide an e.g. alias="one" and alias="two" in the above config, then you do get two unique metrics. However, it would seem to be a bug to allow telegraf to produce a single useless metric that flaps ambiguously between multiple values if aliases aren't defined.

srebhan commented 2 months ago

@shric the plugin has no way to propagate tags to the stats which are collected at model level so defining an alias is the only way to disambiguate the two instances... Is there any reason not to use the alias mechanism?

shric commented 2 months ago

There is no reason not to use the alias mechanism. The bug is that the produced metrics without alias are useless. A number of solutions spring to mind:

Do not produce this metric if there are no aliases defined and multiple identical output plugins.
Require aliases to be defined if there are multiple identical output plugins and internal metrics are enabled.
Document clearly that if you don't define an alias and have multiple outputs then your metrics will be useless.
Anything else conceivable other than producing "random" metrics.

I consider this to be a bug, so I filed it as an FYI. We worked around it with aliases, but that doesn't mean it's not a bug. If you don't consider it a bug, that is fine. Thanks for taking a look.

srebhan commented 2 months ago

@shric I agree with you that this might be useless for multiple instances without an alias as the metrics might also collide occasionally... However, we cannot change the current layout as this might break things for existing users, otherwise I would have added the unique, auto-generated plugin id to the metrics...

How about adding a clear statement to inputs.internal recommending to add an alias and prepare to pass through the plugin-id but disable it for now?

shric commented 2 months ago

Hi @srebhan, that sounds like a good idea, thanks for giving this some thought. It isn't causing any problems for us, just thought I'd point out the issue.

influxdata / telegraf