Closed aned closed 3 months ago
I see you went all out with the processors :)
The first obvious change I would make is moving the drop-metrics
up in the list of processors under the output. If the event messages are going to be dropped they don't need to travel down the pipeline of processors. Unless you are using them to enrich other values (I didn't see that in the starlark processors)
As for the lldp subscription, If it's not going to change much, use an on-change subscription (if the router supports it)
lldp:
paths:
- "/lldp/interfaces/interface/neighbors/neighbor/state/system-name"
- "/lldp/interfaces/interface/neighbors/neighbor/state/port-id"
stream-mode: on-change
How much of a heavy load are we talking about ? I see you enabled the api-server metrics and you have a Prometheus server.
api-server:
address: :7890
enable-metrics: true
Do you have a target definition for gnmic:7890/metrics
? We will be able to see how much (and what) is being used.
A few more optimisations:
1)
This processor matches ALL your events, and runs the old
regex on all of their values.
rename-metrics:
event-strings:
value-names:
- ".*"
transforms:
- replace:
apply-on: "name"
old: "interfaces/interface/.*/description"
new: "ifAlias"
If you know exactly what you are going to replace, set it in the matching section not in the transform
rename-metrics:
event-strings:
value-names:
- "interfaces/interface/.*/description"
transforms:
- replace:
apply-on: "name"
old: "interfaces/interface/.*/description"
new: "ifAlias"
2)
Same for this processor.
rename-metrics-arista-ngbb:
event-strings:
value-names:
- ".*"
transforms:
- replace:
apply-on: "name"
old: ".*connectivityMonitor/status/hostStatus/.*/defaultStats/packetLoss"
new: "PacketLossAristaXBR"
- replace:
apply-on: "name"
old: ".*connectivityMonitor/status/hostStatus/.*/defaultStats/latency"
new: "LatencyAristaXBR"
- replace:
apply-on: "name"
old: ".*connectivityMonitor/status/hostStatus/.*/defaultStats/jitter"
new: "JitterAristaXBR"
- replace:
apply-on: "name"
old: ".*meminfo/memTotal"
new: "MemTotalAristaXBR"
- replace:
apply-on: "name"
old: ".*meminfo/memAvailable"
new: "MemAvailableAristaXBR"
- replace:
apply-on: "name"
old: "/queues/queue"
new: "_queue"
- trim-prefix:
apply-on: "name"
prefix: "/interfaces"
- trim-prefix:
apply-on: "name"
prefix: "/qos/interfaces"
All these transforms are independent from each other.
The transforms in a single event-strings
processor are applied the all the event messages in sequence.
So I would create separate processors for each one.
3)
In this processor, the old tags are well known
rename-labels-interface:
event-strings:
tag-names:
- ".*"
transforms:
- replace:
apply-on: "name"
old: "source"
new: "alias"
- replace:
apply-on: "name"
old: "interface_name"
new: "ifName"
- replace:
apply-on: "name"
old: ".*interface-id"
new: "ifName"
I would place them in the tag-names
field or even create a processor for each one.
rename-labels-interface:
event-strings:
tag-names:
- "source"
- "interface_name"
- .*interface-id"
transforms:
- replace:
apply-on: "name"
old: "source"
new: "alias"
- replace:
apply-on: "name"
old: "interface_name"
new: "ifName"
- replace:
apply-on: "name"
old: ".*interface-id"
new: "ifName"
There are a couple more processors like this, I think you get the idea. you can save a lot by skipping a few regex evaluations (over 30 routers).
Got it, thanks for the inputs! I went from 17 to 32 targets, updated as suggested above, seems reasonable, will do more tweaking but it's got the potential!
How does num-workers: 5
affect things in the outputs configuration?
How does num-workers: 5 affect things in the outputs configuration?
It defines the number of parallel routines reading gNMI notifications from the target's buffer and converting them into Prometheus metrics. It's supposed to help deal with high rate of notifications. Looking at the dashboards you shared, I think you might benefit from more workers. It would reduce the total Goroutines you have running.
Understood, I bumped it to 10, seeing some marginal improvements. What's the "recommended" number of workers or how to find the optimal number?
There is no recommended number really. It depends on the pattern (rate and size) of the updates you are getting. I would aim at lowering the number of goroutines running and keep it stable over multiple sample intervals. The optimal number depends on whether you are optimizing for mem or cpu ? If you want to reduce memory, add more workers so that notifications are not hanging in memory waiting to be processed. If you want to reduce cpu usage, reduce the number of workers, but you will be using more memory.
Got it! In terms of monitoring targets (can't subscribe due to auth issues, potential acl issues, etc) , from what I can see in the api /metrics endpoint
api-server:
address: :7890
enable-metrics: true
I could only use something like
sum by (source) (rate(gnmic_subscribe_number_of_received_subscribe_response_messages_total{ job=~"$job_name"}[2m])) ==0
but it looks like this metrics disappears for a specific source if gnmic can't connect to it anymore. How do you folks monitor it?
This could be used
rate(grpc_client_handled_total{job=~"$job_name"}[2m]) > x
but doesn't tell me which target is erroring though.
Currently, this is your best bet:
sum by (source) (rate(gnmic_subscribe_number_of_received_subscribe_response_messages_total{ job=~"$job_name"}[2m])) ==0
Can't that metric default to zero if it's not returned ?
No, once the box becomes gnmi unreachable, all those metrics disappear, they don't become 0. It'd only work if the gnmi connection stays up.
some sort of temporary workaround here: https://github.com/openconfig/gnmic/issues/419#issuecomment-2288642468
Raised a feature request #513 .
I'm testing this config file https://gist.github.com/aned/8b68e77791dc3bb9eeda903ce54e1643 After adding ~30 targets, I'm seeing some pretty heavy load on the server. Are there any obvious improvements I can make in the config?
In this section
I'm caching lldp, it doesn't change much so need to do 30s updates, would it break things if
sample-interval
is set to like1h
or it needs to be done via cache expiration?