Closed aned closed 1 month ago
I think you have a typo in the config, it's append-subscription-name
and strings-as-labels
not append_subscription_name
and strings_as_labels
.
The output you shared is not from gNMIc, that's another tool.
Oh yeah, my bad, LLMs hallucinating big time :(.
With the strings-as-labels: true
it works as expected.
I'm still seeing subscription_name="sub1"
in metrics even though append-subscription-name: false
.
Are there any good config examples documented somewhere for networking boxes?
append-subscription-name
appends the subscription name to the metric name, not as a label.
The subscription-name label is added by default. If you want to drop it, you can use an event-delete
processor:
processors:
delete-processor: <- # processor name
event-delete: <- # processor type
tag-names:
- "subscription-name" # <- delete tags called `subscription-name`
outputs:
prometheus:
type: prometheus
listen: :9273
strings-as-labels: true
event-processors:
- delete-processor # <- processor name
Understood, thank you @karimra!
Sorry for bugging again @karimra , would you have some pointers as of why I can't prepend interface description in here. This is the raw output:
{
"openconfig-interfaces:admin-status": "DOWN",
"openconfig-interfaces:counters": {
"carrier-transitions": "1",
"in-broadcast-pkts": "0",
"in-discards": "0",
"in-errors": "0",
"in-fcs-errors": "0",
"in-multicast-pkts": "0",
"in-octets": "0",
"in-pkts": "0",
"in-unicast-pkts": "0",
"out-broadcast-pkts": "0",
"out-discards": "0",
"out-errors": "0",
"out-multicast-pkts": "0",
"out-octets": "0",
"out-pkts": "0",
"out-unicast-pkts": "0"
},
"openconfig-interfaces:description": "[AVAILABLE]",
"openconfig-interfaces:enabled": false,
"openconfig-platform-port:hardware-port": "Ethernet36-Port",
"openconfig-interfaces:ifindex": 36001,
"arista-intf-augments:inactive": false,
"openconfig-interfaces:last-change": "1710978986275730133",
"openconfig-interfaces:management": false,
"openconfig-interfaces:mtu": 0,
"openconfig-interfaces:name": "Ethernet36/1",
"openconfig-interfaces:oper-status": "NOT_PRESENT",
"openconfig-platform-transceiver:transceiver": "Ethernet36",
"openconfig-interfaces:type": "iana-if-type:ethernetCsmacd"
}
I'm trying to add interface description to all related counters via this config
outputs:
prometheus:
type: prometheus
listen: :9273
append-subscription-name: false
strings-as-labels: true # Converts string values to labels
timeout: 30s
event-processors:
- add-int-description
processors:
add-int-description:
event-value-tag:
value-name: "/interfaces/interface/state/description"
tag-name: "ifAlias"
consume: false
debug: false
but none of the related counter metrics get a label ifAlias
.
Figured out that setting cache to -1
"fixes" it, but not sure if it's the correct approach, I see a lot of stale
warnings in the debug mode.
2024/07/24 06:12:07.070613 /home/runner/work/gnmic/gnmic/pkg/cache/oc_cache.go:155: [cache:oc] failed to update gNMI cache: update is stale
Lets say this is my testing config (with some more metrics added eventually), would it scale for 100s or 1000s of targets running gnmic from a single beefy node?
How can I get the up
metric per target, similar to what prometheus generates? Since this gnmic would be a single endpoint generating metrics for 100s of boxes, how would I know if any of the targets are down?
targets:
blah.com:
address: blah.com:6030
subscriptions:
sub1:
paths:
- "qos/interfaces/interface/output/queues/queue/state"
- "/interfaces/interface[name=*]/state"
stream-mode: sample
sample-interval: 10s
heartbeat-interval: 60s
updates-only: false
outputs:
prometheus:
type: prometheus
listen: :9273
append-subscription-name: false
strings-as-labels: true # Converts string values to labels
timeout: 30s
cache:
type: oc
expiration: -1
event-processors:
- rename-labels
- add-int-description
- delete-labels
- rename-metrics
- drop-metrics
processors:
add-int-description:
event-value-tag:
value-name: "/interfaces/interface/state/description"
tag-name: "ifAlias"
consume: false # if true, remove value from original event when copying
debug: false
delete-labels:
event-delete:
tag-names:
- "subscription-name"
- "target"
drop-metrics:
event-drop:
value-names:
- ".*/state/(type|transceiver|physical.*|mtu|management|inactive|ifindex|hardware.*|enabled)"
rename-metrics:
event-strings:
value-names:
- ".*"
transforms:
- replace:
apply-on: "name"
old: "interfaces/interface/.*/description"
new: "ifAlias"
rename-labels:
event-strings:
tag-names:
- ".*"
transforms:
- replace:
apply-on: "name"
old: "source"
new: "alias"
- replace:
apply-on: "name"
old: "interface_name"
new: "ifName"
- replace:
apply-on: "name"
old: ".*interface-id"
new: "ifName"
I see a lot of stale warnings in the debug mode.
Your router is sending notifications with old timestamps or the router and the machine where gNMIc is running are not in sync.
Lets say this is my testing config (with some more metrics added eventually), would it scale for 100s or 1000s of targets running gnmic from a single beefy node?
Vague question, how many more metrics, how many paths, with which encoding and which sample interval? how "beefy" is a "beefy" node ? Only testing within your env will tell. It's probably easier and more manageable to run a cluster of gNMIc instances if you are going to subscribe to 1000s of routers, check here
How can I get the up metric per target, similar to what prometheus generates?
You can't.
Since this gnmic would be a single endpoint generating metrics for 100s of boxes, how would I know if any of the targets are down?
gNMIc generates some internal metrics that can be scraped by Prometheus, you will be able to tell how many gNMI clients are connected, how many subscriptions there are per client, how many messages were received per client/per subscription.
Enable them under api-server
api-server:
address: :7890
enable-metrics: true
The endpoint is /metrics
And again, this is not a raw output:
{
"openconfig-interfaces:admin-status": "DOWN",
"openconfig-interfaces:counters": {
"carrier-transitions": "1",
"in-broadcast-pkts": "0",
"in-discards": "0",
"in-errors": "0",
"in-fcs-errors": "0",
"in-multicast-pkts": "0",
"in-octets": "0",
"in-pkts": "0",
"in-unicast-pkts": "0",
"out-broadcast-pkts": "0",
"out-discards": "0",
"out-errors": "0",
"out-multicast-pkts": "0",
"out-octets": "0",
"out-pkts": "0",
"out-unicast-pkts": "0"
},
"openconfig-interfaces:description": "[AVAILABLE]",
"openconfig-interfaces:enabled": false,
"openconfig-platform-port:hardware-port": "Ethernet36-Port",
"openconfig-interfaces:ifindex": 36001,
"arista-intf-augments:inactive": false,
"openconfig-interfaces:last-change": "1710978986275730133",
"openconfig-interfaces:management": false,
"openconfig-interfaces:mtu": 0,
"openconfig-interfaces:name": "Ethernet36/1",
"openconfig-interfaces:oper-status": "NOT_PRESENT",
"openconfig-platform-transceiver:transceiver": "Ethernet36",
"openconfig-interfaces:type": "iana-if-type:ethernetCsmacd"
}
I'm not sure where you are getting it from but it does not tell me what gNMI messages the router is sending. It looks like you are running a get RPC to print the output, a subscribe RPC may send data in separate notifications. That's why the processor couldn't add the interface description as a label, it received the stats and the description in separate notifications. Enabling caching is probably an overkill solution for that. Check this processor instead: https://gnmic.openconfig.net/user_guide/event_processors/event_starlark/#set-an-interface-description-as-a-tag
delete-labels: event-delete: tag-names:
- "subscription-name"
- "target"
If you remove the 'target' you won't be able to tell which target the metrics are for ?
I see a lot of stale warnings in the debug mode.
Your router is sending notifications with old timestamps or the router and the machine where gNMIc is running are not in sync.
Without
cache expiration: -1
i don't get anystale
warnings.Lets say this is my testing config (with some more metrics added eventually), would it scale for 100s or 1000s of targets running gnmic from a single beefy node?
Vague question, how many more metrics, how many paths, with which encoding and which sample interval? how "beefy" is a "beefy" node ? Only testing within your env will tell. It's probably easier and more manageable to run a cluster of gNMIc instances if you are going to subscribe to 1000s of routers, check here
Probably ~2k total metrics per device, ~10 paths, 30s scraping interval (how do I check which encoding is being used and how does it play a role?). 256G, 12cpu amd node. Thanks for the HA reference, will check it out, I'm still in testing stage.
How can I get the up metric per target, similar to what prometheus generates?
You can't.
Since this gnmic would be a single endpoint generating metrics for 100s of boxes, how would I know if any of the targets are down?
gNMIc generates some internal metrics that can be scraped by Prometheus, you will be able to tell how many gNMI clients are connected, how many subscriptions there are per client, how many messages were received per client/per subscription.
Enable them under
api-server
api-server: address: :7890 enable-metrics: true
The endpoint is
/metrics
Thank you! And again, this is not a raw output:
{ "openconfig-interfaces:admin-status": "DOWN", "openconfig-interfaces:counters": { "carrier-transitions": "1", "in-broadcast-pkts": "0", "in-discards": "0", "in-errors": "0", "in-fcs-errors": "0", "in-multicast-pkts": "0", "in-octets": "0", "in-pkts": "0", "in-unicast-pkts": "0", "out-broadcast-pkts": "0", "out-discards": "0", "out-errors": "0", "out-multicast-pkts": "0", "out-octets": "0", "out-pkts": "0", "out-unicast-pkts": "0" }, "openconfig-interfaces:description": "[AVAILABLE]", "openconfig-interfaces:enabled": false, "openconfig-platform-port:hardware-port": "Ethernet36-Port", "openconfig-interfaces:ifindex": 36001, "arista-intf-augments:inactive": false, "openconfig-interfaces:last-change": "1710978986275730133", "openconfig-interfaces:management": false, "openconfig-interfaces:mtu": 0, "openconfig-interfaces:name": "Ethernet36/1", "openconfig-interfaces:oper-status": "NOT_PRESENT", "openconfig-platform-transceiver:transceiver": "Ethernet36", "openconfig-interfaces:type": "iana-if-type:ethernetCsmacd" }
I'm not sure where you are getting it from but it does not tell me what gNMI messages the router is sending. It looks like you are running a get RPC to print the output, a subscribe RPC may send data in separate notifications. That's why the processor couldn't add the interface description as a label, it received the stats and the description in separate notifications. Enabling caching is probably an overkill solution for that. Check this processor instead: https://gnmic.openconfig.net/user_guide/event_processors/event_starlark/#set-an-interface-description-as-a-tag
Yes, that was the output of gnmi get
cli, I thought they'd be the same as subscribe
.
Will definitely check out event_starlark
, caching seems to eat up way more resources.
delete-labels: event-delete: tag-names:
- "subscription-name"
- "target"
If you remove the 'target' you won't be able to tell which target the metrics are for ?
Some metrics had the same label value in source
and target
, seemed duplicate, will pay attention if it causes any issues.
@aned in your pastings you had
./gnmi -addr blah.com:6030 get "/interfaces/interface[name=*]/openconfig-interfaces:config"
Note, that this tool ./gnmi
is not gnmic
. Outputs from this tool doesn't help to correlate what you get with it and gnmic outputs.
I suggest you use gnmic
solely.
Without cache expiration: -1 i don't get any stale warnings.
Do you mean With ? -1
disables cache expiration so it disables timestamp checking it doesn't mean the stale issue is gone. By setting expiration: -1
you will get old values at every scrape (e.g deleted interfaces)
Probably ~2k total metrics per device, ~10 paths, 30s scraping interval.
What I mean here is that everyone's case is different, it's hard to give a scaling reference. There is no way to know how many individual single metrics are behind a path, there could be multiple containers and lists behind a path.
how do I check which encoding is being used and how does it play a role?
If you didn't specify an encoding it uses JSON by default. The encoding determines the returned values type. While JSON and JSON_IETF allow the router to send values together as a JSON object (potentially lower message rate, higher message size) other encodings send each values as a single update (multiple updates can be bundled in a single SubscribeResponse (potentially higher message rate, lower message size) You should check what works best for your devices and the metrics you are collecting.
Start small, add a few devices and extrapolate from there.
Understood, thank you for your help guys!
I'm using this config
but I'm getting only 3 prometheus metrics out
Found this https://github.com/openconfig/gnmic/issues/343#issuecomment-1883649487 and trying to understand how to get string values into prometheus as labels, i.e this is raw get output from that path
Is there a workaround to get string values as labels into prometheus, something like this (for the example above):
Basically the string values need to go in as labels and the actual prometheus value should be set to 1. In many cases, the metrics without those strings as labels are useless, for example without the interface description I can't do any proper query in prometheus, matching only specific interface descriptions, etc.