openconfig / gnmic

gNMIc is a gNMI CLI client and collector
https://gnmic.openconfig.net
Apache License 2.0
168 stars 54 forks source link

Broken Connection Event/Alert #419

Open vkumarsharma opened 4 months ago

vkumarsharma commented 4 months ago

Hi,

My issue is the capability to to raise an alert when a target goes down.

If i turn off a target router, i see an error logged in the log file and gnmic trying to reestablish connection. Is there a way an event/trigger of the same can be generated? Or should i just write my own script to scrap the log file for this event and raise an alert from the script?

Thank you

karimra commented 4 months ago

Can you elaborate a bit on what kind of event/alert you want to raise ? from gNMIc to where? With which protocol ? ...

vkumarsharma commented 4 months ago

Thanks Karim for responding. Here is my use case 1) I have an observability frontend on all my network devices. 2) I use gnmic to subscribe to these multiple network devices (routers/switches.) to moniter their state and resport back to my front end using Kafka. So for instance if an interface goes down, i get an event through subscription event and i put it in a Kafka Topic which can be consumed by different software components to report the state to the front end user quickly. 3) However if the router itself goes down, I cannot directly figure that out using any events or triggers from gnmic, The only way i can see right now is through log files where gnmic reports broken connection. 4) So what i have to do is to run a script that reads the file continiously to see the relevant error and then put this as a message in Kafka Queue to communicate a router going down.Similarly when i see a log of connection getting re-established I send a message to communicate a router coming back online.

So my query is where my solution at point 4, is it the best i can do with Gnmic or is there a more streamlined approach offered(Through some trigger/event in gnmic itself)?

Thanks agin

mwdomino commented 1 week ago

To solve a similar issue we use a single subscription (device version) as a "heartbeat" which is run every minute. If our application detects that it has not received a message on that topic for >3min (3 intervals) we consider it to have gone offline and we increment a prometheus counter in our application. We then clear this prometheus counter if we see the device begin sending messages again.