Define constant OID values in config instead of collecting via SNMP

Cougar commented 1 year ago

Just an idea how to save a lot of network equipment CPU and collect metrics faster / more often.

Current situation is following. We'd like to collect data in 30 sec or shorter interval. One kind of routers that we monitor have up to ~4600 interfaces in ifTable.

It takes around 30-45 seconds just to bulk walk through the ifName and around same amount of time to read ifDescr entries.

As we need metrics only from around 200 interfaces, we just list all these interfaces under get section instead of doing walk over all 4600.

This is much faster: every 25 OIDs take between 200-700 ms depending on router load. In total it takes up to 10 sec which is much better but still a lot of time.

Most of entries in those tables are always same and never change (index is combined by chassis/card/slot/port number). Still, there are some that are different in different boxes (user configured dynamic interfaces like LAGs, loopbacks or tunnel interfaces).

I came up with a solution where we can define those never changing OID values in configuration file and skip SNMP collection. These can be still used for labels which is very important to group metrics later.

Here is a snmp_scrape_duration_seconds change from around 21 seconds to around 11 seconds after we replaced the get list of OIDs with static values:

2022-12-05_013454_474234990

The POC implementation itself is very simple (only 10 LOC) and can be seen here https://github.com/Cougar/snmp_exporter/commit/ad1344c93afcca8d174e1389ccceb0ef64290ee3

Configuration looks like this (the names are not unique but this is fine):

modulename:
  staticoids:
  # ifDescr
    - name: 1.3.6.1.2.1.2.2.1.2.10300101
      value: MAC Domain - EPON Port 1/1/1/1
    - name: 1.3.6.1.2.1.2.2.1.2.10300110
      value: Downstream - 1G EPON Port 1/1/1/1
    - name: 1.3.6.1.2.1.2.2.1.2.10300111
      value: Downstream - 10G EPON Port 1/1/1/1
    - name: 1.3.6.1.2.1.2.2.1.2.10300120
      value: Upstream - 1G EPON Port 1/1/1/1
    - name: 1.3.6.1.2.1.2.2.1.2.10300121
      value: Upstream - 10G EPON Port 1/1/1/1
… 328 more OIDs
  # ifName
    - name: 1.3.6.1.2.1.31.1.1.1.1.10300101
      value: Ca1/1/1/1
    - name: 1.3.6.1.2.1.31.1.1.1.1.10300110
      value: Ca1/1/1/1-downstream
    - name: 1.3.6.1.2.1.31.1.1.1.1.10300111
      value: Ca1/1/1/1-downstream
    - name: 1.3.6.1.2.1.31.1.1.1.1.10300120
      value: Ca1/1/1/1-upstream
    - name: 1.3.6.1.2.1.31.1.1.1.1.10300121
      value: Ca1/1/1/1-upstream
… 328 more OIDs
  get:
   # ifDescr - names that could be different in different routers
  - 1.3.6.1.2.1.2.2.1.2.35848192
  - 1.3.6.1.2.1.2.2.1.2.35880960
  - 1.3.6.1.2.1.2.2.1.2.35913728
  - 1.3.6.1.2.1.2.2.1.2.35946496
  # ifName - names that could be different in different routers
  - 1.3.6.1.2.1.31.1.1.1.1.35848192
  - 1.3.6.1.2.1.31.1.1.1.1.35880960
  - 1.3.6.1.2.1.31.1.1.1.1.35913728
  - 1.3.6.1.2.1.31.1.1.1.1.35946496

My POC does not contain tests yet and I'm sure it's not the best naming convention either nor I don't have any idea how to put it to the generator right now.

It is just an idea with POC for anyone else who finds it usable. I'm going to maintain this feature in my fork right now but I would be very happy is something like this could end up in upstream code finally. All comments, suggestions and other ideas in similar situation are very welcome!

SuperQ commented 1 year ago

Maybe https://github.com/prometheus/snmp_exporter/pull/624 is what you're looking for.

Cougar commented 1 year ago

This looks different but still a nice feature, especially the dynamic mode. Static one looks like a simplification of config file. It is already possible to list all interesting OIDs with index in get list instead of asking full walk.

However, I don't see how can it help to fully avoid SNMP requests which in our optimized case (long list of get values) still takes up to 10 sec per device per each collection and ALWAYS returns the same data.

"Static labeling" asked in #128 sounds something similar for me but was never discussed further.

Another solution would be to use some other data source like Node exporter textfile collector in parallel to save this data and later combine it in PromQL like described here. However we know that it is quite a challenge to combine metrics especially if they come from multiple sources with different timestamps.

dswarbrick commented 1 year ago

You're actually pretty fortunate if your interface OIDs don't change, since service provider environments are pretty dynamic and will have virtual interfaces come and go over time, hence the need to walk ifAlias or ifDescr to correlate the indices.

Walking a full MIB on large routers / switches at regular (and short) intervals is never particularly advisable, due to the substantial CPU load it can cause. From my own experience, even relatively small (e.g. 48 port) Juniper virtual chassis can take over 30 seconds once you start walking more than 10 or so OIDs, simply due to the number of ports.

For something completely different (and probably more scalable), you might want to look into something like Junos Telemetry Interface, or Cisco Streaming Telemetry depending on what vendor your equipment is. A growing number of vendors are supporting gRPC-based telemetry, as they run into scalability / performance limitations of SNMP.

prometheus / snmp_exporter

Define constant OID values in config instead of collecting via SNMP #825