Open Cougar opened 1 year ago
Maybe https://github.com/prometheus/snmp_exporter/pull/624 is what you're looking for.
This looks different but still a nice feature, especially the dynamic mode. Static one looks like a simplification of config file. It is already possible to list all interesting OIDs with index in get
list instead of asking full walk.
However, I don't see how can it help to fully avoid SNMP requests which in our optimized case (long list of get
values) still takes up to 10 sec per device per each collection and ALWAYS returns the same data.
"Static labeling" asked in #128 sounds something similar for me but was never discussed further.
Another solution would be to use some other data source like Node exporter textfile collector in parallel to save this data and later combine it in PromQL like described here. However we know that it is quite a challenge to combine metrics especially if they come from multiple sources with different timestamps.
You're actually pretty fortunate if your interface OIDs don't change, since service provider environments are pretty dynamic and will have virtual interfaces come and go over time, hence the need to walk ifAlias or ifDescr to correlate the indices.
Walking a full MIB on large routers / switches at regular (and short) intervals is never particularly advisable, due to the substantial CPU load it can cause. From my own experience, even relatively small (e.g. 48 port) Juniper virtual chassis can take over 30 seconds once you start walking more than 10 or so OIDs, simply due to the number of ports.
For something completely different (and probably more scalable), you might want to look into something like Junos Telemetry Interface, or Cisco Streaming Telemetry depending on what vendor your equipment is. A growing number of vendors are supporting gRPC-based telemetry, as they run into scalability / performance limitations of SNMP.
Just an idea how to save a lot of network equipment CPU and collect metrics faster / more often.
Current situation is following. We'd like to collect data in 30 sec or shorter interval. One kind of routers that we monitor have up to ~4600 interfaces in
ifTable
.It takes around 30-45 seconds just to bulk walk through the
ifName
and around same amount of time to readifDescr
entries.As we need metrics only from around 200 interfaces, we just list all these interfaces under
get
section instead of doingwalk
over all 4600.This is much faster: every 25 OIDs take between 200-700 ms depending on router load. In total it takes up to 10 sec which is much better but still a lot of time.
Most of entries in those tables are always same and never change (index is combined by chassis/card/slot/port number). Still, there are some that are different in different boxes (user configured dynamic interfaces like LAGs, loopbacks or tunnel interfaces).
I came up with a solution where we can define those never changing OID values in configuration file and skip SNMP collection. These can be still used for labels which is very important to group metrics later.
Here is a
snmp_scrape_duration_seconds
change from around 21 seconds to around 11 seconds after we replaced theget
list of OIDs with static values:The POC implementation itself is very simple (only 10 LOC) and can be seen here https://github.com/Cougar/snmp_exporter/commit/ad1344c93afcca8d174e1389ccceb0ef64290ee3
Configuration looks like this (the names are not unique but this is fine):
My POC does not contain tests yet and I'm sure it's not the best naming convention either nor I don't have any idea how to put it to the generator right now.
It is just an idea with POC for anyone else who finds it usable. I'm going to maintain this feature in my fork right now but I would be very happy is something like this could end up in upstream code finally. All comments, suggestions and other ideas in similar situation are very welcome!