Open jasonyates-thg opened 1 year ago
I would be surprised if this is coming from the routers if it's happening on all devices. It looks like the period is more like 50minutes.
Could you share the prometheus output config used for gNMIc ? Do you have export-timestamps
set to true there ?
I would also look at the promQL expression used in Grafana, I assume it uses the rate()
function.
Which interval is used for the rate function? It's usually recommended to use $__rate_interval
.
I'd agree they are every 50 minutes. My output config is largely the default
# Configure outputs
outputs:
prometheus:
type: prometheus
listen: :9804
path: /metrics
event-processors:
- interface-descriptions
service-registration:
address: 10.249.0.250:8500
service-address: 10.249.1.215
My promQL expression is using irate
vs rate
. If I change to rate
the dips in the graph become less severe but they are still there. The expression is using$__rate_interval
Example query
irate(interfaces_interface_state_counters_in_octets{source=~"router1|router2", description!=""}[$__rate_interval])*8
irate:
rate:
Maybe worth checking the raw collected metrics over the window where the dip is seen. To check what gnmic actually received from the NE
Could you try with export-timestamps: true
? Without it Prometheus assigns the scrape request timestamp to the metrics, instead of the device timestamp.
The second set of graphs show that the dips are not in sync across devices while maintaining the ~50minutes dip interval, I would then manually check the values sent by the device for potential lower counter values as @hellt pointed above.
So, adding export-timestamps: true
fixed the dips in the graph (see below). However, it has also had an adverse impact on the majority of devices whereby the interface metrics are no longer being recorded.
I can't find any reasoning for it. All devices are synced to an NTP time source and have their timezone set to UTC. What's equally interesting is that it only impacted interface metrics, subscriptions to other metrics such as bgp continued to work correctly.
Looking at the metrics exposed by prometheus, it's reporting the same number of samples scraped so my only assumption is that prometheus is dropping them for some reason?
Are the timestamps sent by the devices in nanoseconds ? Is the prometheus server also synced ?
They are sent by the device in miliseconds - example timestamp 1688578740000
I confirmed that both the server running gnmic and the server running prometheus are sync'd to the same timesource and the reported time between those servers and the Juniper devices is correct.
By way of a visual example. I have 2 routers in this graph, both in the same physical location, both sync'd to the same NTP time source, both showing the same time running show system status
. Both are subscribed to by the same gnmic collector (I run a pair using consul for target registration, consul shows the target lock is acquired by gnmic01 for both devices).
One continued to work whilst I had export-timestamps: true
whilst the other did not. Once I removed the configuration, the non-working device started working again.
gNMIc Prometheus output converts timestamps from nanoseconds to milliseconds. The gNMI notification timestamp should be sent by the device in ns, while Prometheus expects ms. Can you double check using the cli which timestamp unit the device uses?
If the device sends timestamps in ms, there is a processor to convert them to ns:
processors:
ts-ms-to-ns:
event-starlark:
source: |
def apply(*events):
for e in events:
# change timestmap to ns
e.timestamp = int(e.timestamp * 1000 * 1000)
return events
Then add the processor name under the prometheus output:
outputs:
prometheus:
type: prometheus
listen: :9804
path: /metrics
event-processors:
- ts-ms-to-ns # <--new processor
- interface-descriptions
service-registration:
address: 10.249.0.250:8500
service-address: 10.249.1.215
I'm using gnmic to monitor various Juniper MX devices, i'm subscribing to interface metrics and the sample interval is configured as 10s. My output is prometheus which is configured to scrape every 20s.
Once I graph my data, there is a routine drop in traffic across every device and every interface roughly every 60 minutes. At this point i'm unsure if it's something the Juniper device is doing, or something gnmic is doing or where to begin troubleshooting.
Example graphs from 2 different devices, different hardware (MX10k vs MX204) running different Junos software versions.