influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.63k stars 5.58k forks source link

Weird behavior of json_v2 parser with multiple input lines and lots of decimal places #13805

Closed agiilit-admin closed 11 months ago

agiilit-admin commented 1 year ago

Relevant telegraf.conf

[[inputs.tail]]
  files = ["./test-input.txt"]
  from_beginning = true
  data_format = "json_v2"

  [[inputs.tail.json_v2]]
    [[inputs.tail.json_v2.object]]
      path = "{fields}"

    [[inputs.tail.json_v2.object.field]]
      path = "fields.kwh_import"
      rename = "o1"

    [[inputs.tail.json_v2.object.field]]
      path = "fields.kwh_export"
      rename = "o2"

    [[inputs.tail.json_v2.object.field]]
      path = "fields.freq"
      rename = "o3"

[[outputs.file]]
  files = ["./test-output.txt"]

Logs from Telegraf

$ telegraf --debug --config test-config.conf
2023-08-21T16:47:49Z I! Loading config: test-config.conf
2023-08-21T16:47:49Z I! Starting Telegraf 1.27.3
2023-08-21T16:47:49Z I! Available plugins: 237 inputs, 9 aggregators, 28 processors, 23 parsers, 59 outputs, 4 secret-stores
2023-08-21T16:47:49Z I! Loaded inputs: tail
2023-08-21T16:47:49Z I! Loaded aggregators:
2023-08-21T16:47:49Z I! Loaded processors:
2023-08-21T16:47:49Z I! Loaded secretstores:
2023-08-21T16:47:49Z I! Loaded outputs: file
2023-08-21T16:47:49Z I! Tags enabled: host=ai-2
2023-08-21T16:47:49Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"ai-2", Flush Interval:10s
2023-08-21T16:47:49Z D! [agent] Initializing plugins
2023-08-21T16:47:49Z D! [agent] Connecting outputs
2023-08-21T16:47:49Z D! [agent] Attempting connection to [outputs.file]
2023-08-21T16:47:49Z D! [agent] Successfully connected to outputs.file
2023-08-21T16:47:49Z D! [agent] Starting service inputs
2023-08-21T16:47:49Z D! [inputs.tail] Tail added for "test-input.txt"
2023-08-21T16:47:59Z D! [outputs.file] Wrote batch of 3 metrics in 89.699µs
2023-08-21T16:47:59Z D! [outputs.file] Buffer fullness: 0 / 10000 metrics
2023-08-21T16:48:05Z D! [agent] Stopping service inputs
2023-08-21T16:48:05Z D! [inputs.tail] Tail removed for "test-input.txt"
2023-08-21T16:48:05Z D! [agent] Input channel closed
2023-08-21T16:48:05Z I! [agent] Hang on, flushing any cached metrics before shutdown
2023-08-21T16:48:05Z D! [outputs.file] Buffer fullness: 0 / 10000 metrics
2023-08-21T16:48:05Z I! [agent] Stopping running outputs
2023-08-21T16:48:05Z D! [agent] Stopped Successfully

System info

Telegraf 1.27.3, Debian 12

Docker

No response

Steps to reproduce

  1. Prepare input file ./test-input.txt as follows:
    {"fields":{"freq":50.02000045776367,"kwh_export":0,"kwh_import":568.5}}
    {"fields":{"freq":50.0099983215332,"kwh_export":0,"kwh_import":568.5}}
    {"fields":{"freq":50,"kwh_export":0,"kwh_import":568.5}}
  2. Run telegraf --debug --config test-config.conf
  3. Observe ./test-output.txt

Expected behavior

Each line of output should contain fields o1, o2, o3:

tail,host=ai-2,path=test-input.txt o3=50.02000045776367,o2=0,o1=568.5 1692636469843257713
tail,host=ai-2,path=test-input.txt o3=50.0099983215332,o2=0,o1=568.5 1692636469843312200
tail,host=ai-2,path=test-input.txt o3=50,o2=0,o1=568.5 1692636469843327461

Actual behavior

Last line of output is missing field o1:

tail,host=ai-2,path=test-input.txt o3=50.02000045776367,o2=0,o1=568.5 1692636469843257713
tail,host=ai-2,path=test-input.txt o3=50.0099983215332,o2=0,o1=568.5 1692636469843312200
tail,host=ai-2,path=test-input.txt o3=50,o2=0 1692636469843327461

Additional info

Number of decimals in freq field seems to have significance. If input is changed to:

{"fields":{"freq":50.02,"kwh_export":0,"kwh_import":568.5}}
{"fields":{"freq":50.01,"kwh_export":0,"kwh_import":568.5}}
{"fields":{"freq":50,"kwh_export":0,"kwh_import":568.5}}

The output is as expected.

Also, there need to be multiple lines before "freq":50 for the anomaly to occur.

Another workaround is to increase decimals for the last freq:

{"fields":{"freq":50.02000045776367,"kwh_export":0,"kwh_import":568.5}}
{"fields":{"freq":50.0099983215332,"kwh_export":0,"kwh_import":568.5}}
{"fields":{"freq":50.00,"kwh_export":0,"kwh_import":568.5}}

..producing:

tail,host=ai-2,path=test-input.txt o3=50.02000045776367,o2=0,o1=568.5 1692638034737406013
tail,host=ai-2,path=test-input.txt o3=50.0099983215332,o2=0,o1=568.5 1692638034737459133
tail,host=ai-2,path=test-input.txt o3=50,o2=0,o1=568.5 1692638034737474051

The anomaly was first observed with AMQP, but it seems to be reproducible with the tail input plugin.

powersj commented 1 year ago

The anomaly was first observed with AMQP, but it seems to be reproducible with the tail input plugin.

This is not an issue with any input plugin, rather with the JSON v2 parser itself. The first version of telegraf I was able to reproduce with was v1.21.0, which introduce the object field/tags in #9449.

Adding some debugging it appears that the parser is currently adding the kwh_export/o2 value twice. First, correctly as 0 and again as the kwh_import/o1 568.5 value. The parent node index that is used is somehow wrong. That results in the wrong rename getting used and the o2 value getting updated again and no o1 value.

I believe this is all related to the indexing and lookups that happen, somewhere long the way they get off because of the previous values. Specifically, the call to p.existsInpathResults(parentIndex) looks up the wrong index and returns o2 value. Those index values come from combineObject() here

agiilit-admin commented 1 year ago

Related: there have been incidents where simultaneously with errors in fields also tag values exctracted from the JSON have ended up with wrong tag names, but I am unable to repeat the behavior at the moment. The config involved both object field and object tag. Btw, there were two name objects in the JSON document, one at the top level and one as tags.name.

That problem (with tags) went away when replacing

[[inputs.tail.json_v2.object]]
  path = "@this"

with

[[inputs.tail.json_v2.object]]
  path = "{fields,tags}"
agiilit-admin commented 1 year ago

If an extra "name":"foo" is added to each JSON document:

{"fields":{"freq":50.02000045776367,"kwh_export":0,"kwh_import":568.5},"name":"foo"}
{"fields":{"freq":50.0099983215332,"kwh_export":0,"kwh_import":568.5},"name":"foo"}
{"fields":{"freq":50,"kwh_export":0,"kwh_import":568.5},"name":"foo"}

and included with json_v2.object path:

[[inputs.tail.json_v2.object]]
  path = "{fields,name}"

it ends up as kwh_import/o1 value in last line of output:

tail,host=ai-2,path=test-input.txt o3=50.02000045776367,o2=0,o1=568.5 1692690843010670721
tail,host=ai-2,path=test-input.txt o3=50.0099983215332,o2=0,o1=568.5 1692690843010756312
tail,host=ai-2,path=test-input.txt o3=50,o2=0,o1="foo" 1692690843010788352
agiilit-admin commented 11 months ago

The problem here seems to be that parse results accumulate in p.subPathResults over multiple calls of parseCriticalPath(). When length of JSON elements vary a lot (as in the provided examples), probability of erroneous index match increases. Sooner or later existsInpathResults() starts misbehaving, and produces the results above.

Setting p.subPathResults = nil in parseCriticalPath() makes this particular issue go away. Having no experience in telegraf internals (or go, for that matter), I have no suggestions on the correct way to fix it.

Could not replicate above behaviour with json_v2 tests with multiple inputs. Have seen it on amqp_consumer, though.