openconfig / gnmic

gNMIc is a gNMI CLI client and collector
https://gnmic.openconfig.net
Apache License 2.0
192 stars 57 forks source link

Prometheus graphs traffic drops regularly #169

Open jasonyates-thg opened 1 year ago

jasonyates-thg commented 1 year ago

I'm using gnmic to monitor various Juniper MX devices, i'm subscribing to interface metrics and the sample interval is configured as 10s. My output is prometheus which is configured to scrape every 20s.

Once I graph my data, there is a routine drop in traffic across every device and every interface roughly every 60 minutes. At this point i'm unsure if it's something the Juniper device is doing, or something gnmic is doing or where to begin troubleshooting.

Example graphs from 2 different devices, different hardware (MX10k vs MX204) running different Junos software versions.

Screenshot 2023-07-04 at 22 45 55 Screenshot 2023-07-04 at 22 46 13
karimra commented 1 year ago

I would be surprised if this is coming from the routers if it's happening on all devices. It looks like the period is more like 50minutes.

Could you share the prometheus output config used for gNMIc ? Do you have export-timestamps set to true there ?

I would also look at the promQL expression used in Grafana, I assume it uses the rate() function. Which interval is used for the rate function? It's usually recommended to use $__rate_interval.

jasonyates-thg commented 1 year ago

I'd agree they are every 50 minutes. My output config is largely the default

# Configure outputs
outputs:
  prometheus:
    type: prometheus
    listen: :9804
    path: /metrics
    event-processors:
      - interface-descriptions
    service-registration:
      address: 10.249.0.250:8500
      service-address: 10.249.1.215

My promQL expression is using irate vs rate. If I change to rate the dips in the graph become less severe but they are still there. The expression is using$__rate_interval

Example query irate(interfaces_interface_state_counters_in_octets{source=~"router1|router2", description!=""}[$__rate_interval])*8

irate:

Screenshot 2023-07-05 at 10 21 02

rate:

Screenshot 2023-07-05 at 10 21 11
hellt commented 1 year ago

Maybe worth checking the raw collected metrics over the window where the dip is seen. To check what gnmic actually received from the NE

karimra commented 1 year ago

Could you try with export-timestamps: true ? Without it Prometheus assigns the scrape request timestamp to the metrics, instead of the device timestamp.

The second set of graphs show that the dips are not in sync across devices while maintaining the ~50minutes dip interval, I would then manually check the values sent by the device for potential lower counter values as @hellt pointed above.

jasonyates-thg commented 1 year ago

So, adding export-timestamps: true fixed the dips in the graph (see below). However, it has also had an adverse impact on the majority of devices whereby the interface metrics are no longer being recorded.

Screenshot 2023-07-05 at 18 02 49 Screenshot 2023-07-05 at 18 39 17

I can't find any reasoning for it. All devices are synced to an NTP time source and have their timezone set to UTC. What's equally interesting is that it only impacted interface metrics, subscriptions to other metrics such as bgp continued to work correctly.

Looking at the metrics exposed by prometheus, it's reporting the same number of samples scraped so my only assumption is that prometheus is dropping them for some reason?

karimra commented 1 year ago

Are the timestamps sent by the devices in nanoseconds ? Is the prometheus server also synced ?

jasonyates-thg commented 1 year ago

They are sent by the device in miliseconds - example timestamp 1688578740000

I confirmed that both the server running gnmic and the server running prometheus are sync'd to the same timesource and the reported time between those servers and the Juniper devices is correct.

jasonyates-thg commented 1 year ago

By way of a visual example. I have 2 routers in this graph, both in the same physical location, both sync'd to the same NTP time source, both showing the same time running show system status. Both are subscribed to by the same gnmic collector (I run a pair using consul for target registration, consul shows the target lock is acquired by gnmic01 for both devices).

One continued to work whilst I had export-timestamps: true whilst the other did not. Once I removed the configuration, the non-working device started working again.

Screenshot 2023-07-05 at 20 16 04
karimra commented 1 year ago

gNMIc Prometheus output converts timestamps from nanoseconds to milliseconds. The gNMI notification timestamp should be sent by the device in ns, while Prometheus expects ms. Can you double check using the cli which timestamp unit the device uses?

If the device sends timestamps in ms, there is a processor to convert them to ns:

processors:
  ts-ms-to-ns:
    event-starlark:
      source: |
        def apply(*events):
          for e in events:
            # change timestmap to ns
            e.timestamp = int(e.timestamp * 1000 * 1000)
          return events

Then add the processor name under the prometheus output:

outputs:
  prometheus:
    type: prometheus
    listen: :9804
    path: /metrics
    event-processors:
      - ts-ms-to-ns # <--new processor
      - interface-descriptions
    service-registration:
      address: 10.249.0.250:8500
      service-address: 10.249.1.215