influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.5k stars 5.55k forks source link

[gGNMI] incorrect metric processing #15792

Open greenfox878 opened 2 weeks ago

greenfox878 commented 2 weeks ago

Relevant telegraf.conf

[[inputs.gnmi]]
  addresses = ["gnmi_dut:xxx"]

  ## define credentials
  username = "xxxxx"
  password = "xxxxx"

  encoding = "proto"
  tagexclude = ["path"]
  max_msg_size = "4MB"
  dump_responses = true

  [inputs.gnmi.tags]
    test_tag = "test"

  [[inputs.gnmi.subscription]]
    name = "cisco_xr_stats_qos_in"
    origin = "Cisco-IOS-XR-qos-ma-oper"
    path = "qos/interface-table/interface/input/service-policy-names/service-policy-instance/statistics"
    subscription_mode = "sample"
    sample_interval = "10s"
    suppress_redundant = false

  [[inputs.gnmi.subscription]]
    name = "cisco_xr_stats_qos_out"
    origin = "Cisco-IOS-XR-qos-ma-oper"
    path = "qos/interface-table/interface/output/service-policy-names/service-policy-instance/statistics"
    subscription_mode = "sample"
    sample_interval = "10s"
    suppress_redundant = false

Logs from Telegraf

no error messages. metric dump attached in comment below

System info

Telegraf 1.30.3, Ubuntu 22.04.3 LTS (latest updates)

Docker

No response

Steps to reproduce

simple starlark script:

[[processors.starlark]]
  order = 200
  namepass = ["cisco_xr_stats_qos_in", "cisco_xr_stats_qos_out"]
  script="/etc/telegraf/starlark/cisco_xr_embedded_tag.star"
  source = '''
load("logging.star", "log")

def apply(metric):
    log.warn("!!!!!>>>>>> start processing new metric!")
    log.warn("!!!!!>>>>>> metric len: {}".format(str(len(metric.fields.keys()))))
    for path, value in metric.fields.items():
        log.warn("!!!!!>>>>>>metric name: {}, path:{}, value: {}".format(metric.name, str(path), str(value)))

    log.warn("!!!!!>>>>>> start processing tags")
    for k, value in metric.tags.items():
        log.warn("!!!!!>>>>>>metric name: {}, path:{}, value: {}".format(metric.name, str(k), str(value)))  
'''

Expected behavior

starlark script should iterate over all elements in array - dump contains 901 element.

In [21]: len(d['update']['update'])
Out[21]: 901

Actual behavior

starlark iterates over last elements of array:

2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>> start processing new metric!
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>> metric len: 37
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:policy_name, value: EXTERNAL-EGRESS-QUEUING
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:state, value: active
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:class_stats/counter_validity_bitmask, value: 7864320
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:class_stats/class_name, value: class-default
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:class_stats/cac_state, value: unknown
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:class_stats/general_stats/transmit_packets, value: 1212927705
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:class_stats/general_stats/transmit_bytes, value: 211143580313
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:class_stats/general_stats/total_drop_packets, value: 3662217
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:class_stats/general_stats/total_drop_bytes, value: 369239845
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:class_stats/general_stats/total_drop_rate, value: 0
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:class_stats/general_stats/match_data_rate, value: 66
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:class_stats/general_stats/total_transmit_rate, value: 66
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:class_stats/general_stats/pre_policy_matched_packets, value: 1216589922
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:class_stats/general_stats/pre_policy_matched_bytes, value: 211512820158
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:class_stats/queue_stats_array/queue_id, value: 0
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:class_stats/queue_stats_array/tail_drop_packets, value: 3662217
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:class_stats/queue_stats_array/tail_drop_bytes, value: 369239845
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:class_stats/queue_stats_array/queue_instance_length/value, value: 0
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:class_stats/queue_stats_array/queue_instance_length/unit, value: policy-param-unit-ms
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:class_stats/queue_stats_array/queue_average_length/value, value: 0
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:class_stats/queue_stats_array/queue_average_length/unit, value: policy-param-unit-ms
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:class_stats/queue_stats_array/queue_max_length/value, value: 0
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:class_stats/queue_stats_array/queue_max_length/unit, value: policy-param-unit-ms
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:class_stats/queue_stats_array/queue_drop_threshold, value: 122368
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:class_stats/queue_stats_array/forced_wred_stats_display, value: False
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:class_stats/queue_stats_array/random_drop_packets, value: 0
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:class_stats/queue_stats_array/random_drop_bytes, value: 0
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:class_stats/queue_stats_array/max_threshold_packets, value: 0
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:class_stats/queue_stats_array/max_threshold_bytes, value: 0
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:class_stats/queue_stats_array/conform_packets, value: 0
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:class_stats/queue_stats_array/conform_bytes, value: 0
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:class_stats/queue_stats_array/exceed_packets, value: 0
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:class_stats/queue_stats_array/exceed_bytes, value: 0
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:class_stats/queue_stats_array/conform_rate, value: 0
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:class_stats/queue_stats_array/exceed_rate, value: 0
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:satid, value: 0
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:policy_timestamp, value: 1724934940835
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>> start processing tags
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:interface_name, value: Bundle-Ether70
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:service_policy_name, value: EXTERNAL-EGRESS-QUEUING
2024-08-29T12:35:45Z W! [processors.starlark] !!!!!>>>>>>metric name: cisco_xr_stats_qos_out, path:source, value: xxxxxxxxx

Additional info

No response

greenfox878 commented 2 weeks ago

update_dump.txt

srebhan commented 2 weeks ago

There is something strange with the GNMI message as the same path is sent multiple times with different values! Check for example class-stats/class-name which appears 8 times with different values... As the path makes up for the field names those 8 entries collide and will make up one field in the end. That's why you don't get the expected number of fields.

How would you interpret the data?

greenfox878 commented 2 weeks ago

Yes, I noticed it. Looks like it's a "feature" of Cisco's native YANG model. Device returns unkeyed list like: [ class_name: tc-1, drops: xxx, ... class_name: tc-2, drops: xxx, .... ] I'm going to create multiple metrics from this list with starlark script

srebhan commented 2 weeks ago

The problem is that you cannot get the fields as they override each other... All fields are collected into a metric if the metric name and tags are identical currently. We could add a "workaround" flag say output_metric_per_field and then avoid grouping the fields into a single metric...

greenfox878 commented 1 week ago

It would be great to have such an option i.e. disable grouping fields and let the user do whatever it needs with Starlark. I think it's more flexible than handling broken vendor models in go code. I can add example starlark script for this particular case.

srebhan commented 1 week ago

Do you think "one metric per update-path/field" would be OK?

greenfox878 commented 1 week ago

I mean, introduce an option i.e "disable_grouping_fields"

  [[inputs.gnmi.subscription]]
  ....
  disable_grouping_fields = true

If the user sets the option it should consider using Starlark to process the metric. In general, logic should be next - if the option is true, pass metric "as is" to the Processor block without field grouping, if no processor logic is applied to metric - create metric per field or drop metric (last is better IMO).

srebhan commented 1 week ago

We cannot pass the metric as-is because the field names collide! There are two options to solve this, simply output one metric per "field" (aka update path) or to append suffixes to the field names like _1, _2 etc.

The former is easy as we just need to skip the grouping step but there is no way to know which metrics belong together after we output them as they are all named the same. So your only bet is to use the metric order for grouping in starlark.

The latter is more difficult on the gnmi-plugin side as we need to figure out if the group already contains a field, so we probably need to adapt the metrics-grouper... However, this format is much simpler to process in starlark as all fields belonging together will have the same suffix...

I need your opinion to judge which way to go. :-)

greenfox878 commented 1 week ago

Thank you for the detailed explanation! I vote for appending suffixes (_1, _2, etc) to the field names because it gives more predictable output.