Closed sebastianw closed 6 months ago
I just found the dump_responses
option for the gnmi plugin and can confirm now that this seems to be a problem with multiple updates.
Checking the output file in prometheus format I see this:
gnmi_sys_memory_used{host="tanagra",path="/system/memory/state",source="arista-switch"} 3.478126592e+09
gnmi_sys_memory_used{host="tanagra",path="/system/memory/state",source="arista-switch"} 3.478450176e+09
gnmi_sys_memory_used{host="tanagra",path="/system/memory/state",source="arista-switch"} 3.479629824e+09
gnmi_sys_memory_used{host="tanagra",source="arista-switch"} 3.479527424e+09
gnmi_sys_memory_used{host="tanagra",path="/system/memory/state",source="arista-switch"} 3.479810048e+09
gnmi_sys_memory_used{host="tanagra",path="/system/memory/state",source="arista-switch"} 3.479003136e+09
gnmi_sys_memory_used{host="tanagra",path="/system/memory/state",source="arista-switch"} 3.478552576e+09
gnmi_sys_memory_used{host="tanagra",path="/system/memory/state",source="arista-switch"} 3.477671936e+09
gnmi_sys_memory_used{host="tanagra",path="/system/memory/state",source="arista-switch"} 3.47871232e+09
This corresponds to this debug output from telegraf:
2024-03-06T15:08:53Z D! [inputs.gnmi] Got update_1709737723565746678: {"update":{"timestamp":"1709737723565746678","prefix":{"elem":[{"name":"system"},{"name":"memory"},{"name":"state"}]},"update":[{"path":{"elem":[{"name":"reserved"}]},"val":{"uintVal":"6357975040"}},{"path":{"elem":[{"name":"used"}]},"val":{"uintVal":"3478126592"}}]}}
2024-03-06T15:09:03Z D! [inputs.gnmi] Got update_1709737733566450753: {"update":{"timestamp":"1709737733566450753","prefix":{"elem":[{"name":"system"},{"name":"memory"},{"name":"state"}]},"update":[{"path":{"elem":[{"name":"reserved"}]},"val":{"uintVal":"6358298624"}},{"path":{"elem":[{"name":"used"}]},"val":{"uintVal":"3478450176"}}]}}
2024-03-06T15:09:13Z D! [inputs.gnmi] Got update_1709737743568119333: {"update":{"timestamp":"1709737743568119333","prefix":{"elem":[{"name":"system"},{"name":"memory"},{"name":"state"}]},"update":[{"path":{"elem":[{"name":"reserved"}]},"val":{"uintVal":"6359478272"}},{"path":{"elem":[{"name":"used"}]},"val":{"uintVal":"3479629824"}}]}}
2024-03-06T15:09:23Z D! [inputs.gnmi] Got update_1709737753565697718: {"update":{"timestamp":"1709737753565697718","update":[{"path":{"elem":[{"name":"system"},{"name":"memory"},{"name":"state"},{"name":"used"}]},"val":{"uintVal":"3479527424"}}]}}
2024-03-06T15:09:33Z D! [inputs.gnmi] Got update_1709737763565722269: {"update":{"timestamp":"1709737763565722269","prefix":{"elem":[{"name":"system"},{"name":"memory"},{"name":"state"}]},"update":[{"path":{"elem":[{"name":"reserved"}]},"val":{"uintVal":"6359670784"}},{"path":{"elem":[{"name":"used"}]},"val":{"uintVal":"3479810048"}}]}}
2024-03-06T15:09:43Z D! [inputs.gnmi] Got update_1709737773566721549: {"update":{"timestamp":"1709737773566721549","prefix":{"elem":[{"name":"system"},{"name":"memory"},{"name":"state"}]},"update":[{"path":{"elem":[{"name":"reserved"}]},"val":{"uintVal":"6358863872"}},{"path":{"elem":[{"name":"used"}]},"val":{"uintVal":"3479003136"}}]}}
2024-03-06T15:09:53Z D! [inputs.gnmi] Got update_1709737783565848683: {"update":{"timestamp":"1709737783565848683","prefix":{"elem":[{"name":"system"},{"name":"memory"},{"name":"state"}]},"update":[{"path":{"elem":[{"name":"reserved"}]},"val":{"uintVal":"6358413312"}},{"path":{"elem":[{"name":"used"}]},"val":{"uintVal":"3478552576"}}]}}
2024-03-06T15:10:03Z D! [inputs.gnmi] Got update_1709737793566147217: {"update":{"timestamp":"1709737793566147217","prefix":{"elem":[{"name":"system"},{"name":"memory"},{"name":"state"}]},"update":[{"path":{"elem":[{"name":"reserved"}]},"val":{"uintVal":"6357532672"}},{"path":{"elem":[{"name":"used"}]},"val":{"uintVal":"3477671936"}}]}}
2024-03-06T15:10:13Z D! [inputs.gnmi] Got update_1709737803565776294: {"update":{"timestamp":"1709737803565776294","prefix":{"elem":[{"name":"system"},{"name":"memory"},{"name":"state"}]},"update":[{"path":{"elem":[{"name":"reserved"}]},"val":{"uintVal":"6358564864"}},{"path":{"elem":[{"name":"used"}]},"val":{"uintVal":"3478712320"}}]}}
@sebastianw thanks for the debugging effort, I will take a look this week...
@sebastianw can you please add
guess_path_tag = true
to your [[inputs.gnmi]]
section (not to the subscriptions) and check if this fixes the issue!?
@srebhan
It does not fix the issue, now I see two different paths instead of one with path and one without. Again an example, this should be one measurement:
curl -s http://localhost:9273/metrics | grep '^gnmi_sys_cpu_hardware_interrupt_min_time{.*index="ALL"'
gnmi_sys_cpu_hardware_interrupt_min_time{host="tanagra",index="ALL",path="/system/cpus/cpu/state",source="arista-switch"} 1.7098053835690132e+18
gnmi_sys_cpu_hardware_interrupt_min_time{host="tanagra",index="ALL",path="/system/cpus/cpu/state/hardware-interrupt",source="arista-switch"} 1.709805483568305e+18
And here are two of the measurements from the telegraf dump log:
2024-03-07T09:55:41Z D! [inputs.gnmi] Got update_1709805333566280930: {"update":{"timestamp":"1709805333566280930","update":[{"path":{"elem":[{"name":"system"},{"name":"cpus"},{"name":"cpu","key":{"index":"ALL"}},{"name":"state"},{"name":"hardware-interrupt"},{"name":"min-time"}]},"val":{"uintVal":"1709805333568034887"}}]}}
2024-03-07T09:55:51Z D! [inputs.gnmi] Got update_1709805343565718902: {"update":{"timestamp":"1709805343565718902","prefix":{"elem":[{"name":"system"},{"name":"cpus"},{"name":"cpu","key":{"index":"ALL"}},{"name":"state"}]},"update":[{"path":{"elem":[{"name":"hardware-interrupt"},{"name":"min-time"}]},"val":{"uintVal":"1709805343567684412"}},{"path":{"elem":[{"name":"idle"},{"name":"avg"}]},"val":{"uintVal":"89"}},{"path":{"elem":[{"name":"idle"},{"name":"instant"}]},"val":{"uintVal":"90"}}]}}
@sebastianw Telegraf cannot know that this should be one measurement. Look at the data you receive from your device:
The first line has one element
prefix:
path: /system/cpus/cpu/state/hardware-interrupt/min-time
value: 1709805333568034887
while the second one has three elements
prefix: /system/cpus/cpu/state
path: hardware-interrupt/min-time
value: 1709805343567684412
and
prefix: /system/cpus/cpu/state
path: idle/avg
value: 89
and
prefix: /system/cpus/cpu/state
path: idle/instant
value: 90
For the path we usually use the prefix
element but as you can see your first data-point doesn't have that. We have no information on what part of the path the "prefix" is, so we have to guess the "common" part by leaving our the last path element. For the other three elements of the second update we get an explicit information about the "common" part in the prefix
...
The only way would be to get an information on what the common part is. I guess you want the "subscription" path. Is this correct?
@sebastianw can you please test the binary in #14951 available once CI finished the tests successfully!? Please use
[[inputs.gnmi]]
addresses = ["arista-switch:6030"]
username = "HIDDEN"
password = "HIDDEN"
redial = "60s"
tls_enable = true
insecure_skip_verify = true
path_guessing_strategy = "subscription"
[[inputs.gnmi.subscription]]
...
in your config to use the subscription path as path
tag if it is missing. Let me know if this fixes your issue!
@srebhan This is a standard Arista gNMI export. As I see it the "full path" in the two updates is the same for my measurement in question:
First update:
prefix("") + path("/system/cpus/cpu/state/hardware-interrupt/min-time") = /system/cpus/cpu/state/hardware-interrupt/min-time
Second update:
prefix("/system/cpus/cpu/state") + path("hardware-interrupt/min-time") = /system/cpus/cpu/state/hardware-interrupt/min-time
Plus the two additional measurements in the second update:
prefix("/system/cpus/cpu/state") + path("idle/avg") = /system/cpus/cpu/state/idle/avg
prefix("/system/cpus/cpu/state") + path("idle/instant") = /system/cpus/cpu/state/idle/instant
So I would think the path tag is always the "full" path, in this case these three for the different measurements:
/system/cpus/cpu/state/hardware-interrupt/min-time
/system/cpus/cpu/state/idle/avg
/system/cpus/cpu/state/idle/instant
@sebastianw this might be true for prometheus outputs, but other outputs support multiple fields per metric (e.g. InfluxDB) so the path cannot be full. You can use the canonical_field_names
option to get the full path as field names if this is what you are after.
@sebastianw can you please test the binary in #14951 available once CI finished the tests successfully!?
@srebhan I just tested this and with path_guessing_strategy = "subscription"
in the config the problem seems indeed to be gone. I now get full paths all the time:
curl -s http://localhost:9273/metrics | grep '^gnmi_sys_cpu_hardware_interrupt_min_time{.*'
gnmi_sys_cpu_hardware_interrupt_min_time{host="tanagra",index="0",path="/system/cpus/cpu/state",source="arista-switch"} 1.709815703566674e+18
gnmi_sys_cpu_hardware_interrupt_min_time{host="tanagra",index="1",path="/system/cpus/cpu/state",source="arista-switch"} 1.70981570356715e+18
gnmi_sys_cpu_hardware_interrupt_min_time{host="tanagra",index="2",path="/system/cpus/cpu/state",source="arista-switch"} 1.7098157035676063e+18
gnmi_sys_cpu_hardware_interrupt_min_time{host="tanagra",index="3",path="/system/cpus/cpu/state",source="arista-switch"} 1.7098157035680916e+18
gnmi_sys_cpu_hardware_interrupt_min_time{host="tanagra",index="ALL",path="/system/cpus/cpu/state",source="arista-switch"} 1.7098157035685952e+18
I also tested with canonical_field_names
enabled, but that gives me no metrics at all, only errors like these:
2024-03-07T12:44:38Z E! [inputs.gnmi] Invalid empty path "/system/memory/state/reserved" with alias "/system/memory/state"
2024-03-07T12:44:38Z E! [inputs.gnmi] Invalid empty path "/system/memory/state/used" with alias "/system/memory/state"
So it seems your fix works, even though I don't think I understand the logic completely tbh. :)
@srebhan Okay I think I got it now, the measurement "name" (For example gnmi_sys_cpu_idle_avg
for Prometheus and gnmi_sys_cpu,... idle/avg=90i
for InfluxDB) was always the "subscription name" as prefix and the full path minus the subscription path as suffix for Prometheus and measurement name for InfluxDB. Only difference with the new option now is that the path tag is filled with the subscription path?
I looked at the gNMI specification for Paths and it seems Arista is using it in the expected way.
In a number of messages, a prefix can be specified to reduce the lengths of path fields within the message. In this case, a prefix field is specified within a message - comprising of a valid path encoded according to Section 2.2.2. In the case that a prefix is specified, the absolute path is comprised of the concatenation of the list of path elements representing the prefix and the list of path elements in the path field.
So the prefix is there to reduce message length and I think from a gNMI standpoint a measurement is unique by the full path (optional prefix + path elements). If we were only interested in Prometheus I would suggest putting this full path in the path tag but as you already said this would mess up InfluxDB and maybe other formats.
I wonder if there is a way (and value) to preserve the full path for Prometheus instead of just the subscription path, but at least for our specific application the proposed fix works fine.
@sebastianw thanks for testing! Regarding the logic: With the new option the path is set to the prefix
of the message if it exists, otherwise it will use the subscription path (with the option you set).
The issue is not that the path is invalid, it's only that it's not consistent across updates. It was a bad choice to add the path
tag at all I think, but that's a totally different thing...
For the invalid canonical field names error here I would love to get a response dump of one or two failing messages...
@sebastianw it would be nice if you could also test #14953 for the canonical-field-names. Please note, the PR there will not provide the paths until both PRs are merged...
@srebhan The canonical field name setting made all messages fail, I had 0 measurements so you could test with any of the messages from the dumps posted in this issue. If you need more let me know. With the binary from #14953 I get measurements with names like gnmi_sys_cpu__system_cpus_cpu_state_wait_avg
and no failures.
Regarding the path tag, I wouldn't mind dropping it completely (maybe via a config option) in the input plugin. We tried to do that anyway in Prometheus post-processing but this confuses Prometheus when we have two measurements that only differ in the path name and clash after removal of the path (hence this issue here).
Ok, so to summarize, the first PR fixes the path
tag issue and you were able confirm this. The second PR fixes the canonical_field_names
issue and you were able to confirm this.
Thanks for testing!
What you can do is to use canonical-field-names, drop the path
tag via tagexclude
(see metric filtering) and then use the regex processor to construct the metric and field names as you need them...
Relevant telegraf.conf
Logs from Telegraf
System info
Telegraf 1.29.5, MacOS Homebrew
Docker
No response
Steps to reproduce
Expected behavior
Measurements have consistent path attributes.
Actual behavior
Measurements sometimes are missing the path attribute. This leads for example to duplicate measurements in the prometheus exporter. The following should be one measurement, not two:
Checking the output file in prometheus format we can see that this changes from time to time for the same measurement:
Checking the same metric in the influxdb output file, it seems that the path only get recorded when there are multiple measurements in one update (in this case
user/min
anduser/min_time
):Additional info
Testing the same gNMI output from a python script with pygnmi show consistent measurements coming from the switch. The only difference is that updates with multiple measurements have a prefix and relative path while single measurement updates have a full path: