Metric tracking within external plugins

Wimmelwaldi commented 1 month ago

Relevant telegraf.conf

[agent]
  interval = "10s"
  debug = true
  logtarget = "stderr"
  logfile = ""

[[inputs.execd]]
  command = ["/etc/telegraf/extern_plugin/mqtt_consumer", "-config", "/etc/telegraf/extern_plugin/plugin.conf"]
  signal = "none"
  data_format = "influx"
  name_override = "device"
  tags = {device = "default_device"}

[[outputs.influxdb_v2]]
  urls = ["http://influxdb:8086"]
  token = "xxx"
  organization = "yyy"
  bucket = "zzz"
  tagexclude = ["device", "asset"]

  data_format = "json"

  [outputs.influxdb_v2.tagpass]
    device = ["default_device"]

Logs from Telegraf

telegraf | 2024-06-25T12:36:32Z D! [outputs.influxdb_v2] Wrote batch of 2 metrics in 13.371321ms
telegraf | 2024-06-25T12:36:32Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
telegraf | 2024-06-25T12:36:42Z D! [outputs.influxdb_v2] Wrote batch of 2 metrics in 5.369446ms
telegraf | 2024-06-25T12:36:42Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
telegraf | 2024-06-25T12:36:52Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
telegraf | 2024-06-25T12:37:02Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
telegraf | 2024-06-25T12:37:12Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
telegraf | 2024-06-25T12:37:22Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
...

System info

Telegraf 1.31, Windows 10

Docker

No response

Steps to reproduce

Probably the easiest way to reproduce this:

Convert the input plugin mqtt_consumer into an external plugin

Expected behavior

Receive feedback when a metric has been successfully written to stdout so that the plugin does not get stuck.

Actual behavior

The maximum amount of processed messages will be equal to max_undelivered_messages because the metrics are never getting acknowledged.

Additional info

I had already created a post about this in the community forum: https://community.influxdata.com/t/metric-tracking-with-external-plugin/34710

I think it is a serious bug that the methods for metric tracking in an external plugin do not work at all, because the accumulator is provided by shim and is processed by it as well. The metrics were simply neglected to be acknowledged under certain circumstances.

Hence my suggestion, to use the two methods m.Drop() and m.Reject() after the call in this line https://github.com/influxdata/telegraf/blob/v1.31.0/plugins/common/shim/goshim.go#L121

Yes, the functionality of metric tracking when used in an external plugin would then differ somewhat from the usual application, but it would still provide useful functionality under these circumstances.

powersj commented 1 month ago

Hi,

I haven't responded to the forums as I need to dig in and understand the implications of making this change to existing users. In general, no one has ever mentioned tracking metrics with the shim before, so it is unlikely it would impact anyone, but we need to ensure that how we acknowledge the metrics and when is done correctly.

Let me talk to folks on Monday and get back to you.

powersj commented 2 weeks ago

Hi,

My apologies for the delay, I got sick last week and then we had some holiday time.

After chatting with the team, I think we agree that it makes sense to make a change to Telegraf to call m.Accept on the metric. This way the metrics are always considered accepted, internal tracking metrics will show a correct count, rather than a large number of dropped metrics, and it ensures that the max_undelivered_messages will correctly decrement as well.

An additional clear statement to the readme would be needed to ensure that users understand that the plugin calls accepted no matter what.

Are you willing to put up a PR to make this change?

Thanks!

influxdata / telegraf