kytos-ng / of_core

Kytos Main OpenFlow Network Application (NApp)
MIT License
0 stars 5 forks source link

feat: identify and expose when connections are being closed or crashing constantly #101

Open viniarck opened 1 year ago

viniarck commented 1 year ago

Problem:

Network operators who are deploying Kytos-ng in production and using of_core need to be able to identify (and hook it on external healthcheck mechanisms) when OpenFlow connections aren't getting stable either because of packets/handshake or a generalized crashes. Our python runtime shouldn't not struggle handling connections as long as it's a reasonable value, if it is, then of_core should expose that this is happening (maybe through and endpoint) just so this can be used externally to spun up and switchover to a different kytosd instance, this can help for recoverable errors.

Other than that, outside of code related implementation, network operators should also have alerts for how many errors or tracebacks have happened overtime, we can have this readily available on ES with Kibana, although alerts are premium ES feature, but the data is there, so a script could also poll or query that:

20230215_150853

20230215_150842

cc'ing @italovalcy for his info

This issue still needs further discution, but overall that's the problem we need to solve.

italovalcy commented 1 year ago

I agree, @viniarck. This feature can be part of a watchdog Napp or something like this, which consolidates all validations (not only of_core) and translates into an operational status (which could indicate success, failure, or partial failure - includingg failure in non-critical components, so on)