nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
1 stars 0 forks source link

Create and put into place a way to monitor & alert on the NERC Switches #349

Open joachimweyl opened 8 months ago

joachimweyl commented 8 months ago

Motivation

When one of the switches went down in NERC it turned out it had been down for multiple days. We realized we needed a more robust way to monitor switches.

Completion Criteria

Monitoring and alerting tested and working for the switches on the NERC network.

Description

Completion dates

Desired - 2024-01-25 Required - TBD

joachimweyl commented 8 months ago

@jtriley is this something that Nick and/or Christian have discussed with you?

hpdempsey commented 7 months ago

This telemetry is also important for our researchers. They don't need access to the switches, but we do need to have access to at least the basic the telemetry for research (e.g. pps on VLANs etc.).

joachimweyl commented 7 months ago

We can only directly track the switches that are 100% NERC used, for switches shared with other projects we do not get access to them. @jtriley for the Switches that are 100% NERC do we have an automated alerting setup?

jtriley commented 7 months ago

No that will need to be implemented. The access is open though from the infra admin vlan via ICMP ping tests to the devices.

schwesig commented 4 months ago

@jtriley As discussed in the meeting right now: If pinging the switches is the only option, please add a list of pingable switches etc. and possible other ideas on how to "reach/monitor" them Thanks

schwesig commented 4 months ago

ideas:

schwesig commented 4 months ago

https://docs.openshift.com/container-platform/4.13/observability/network_observability/network-observability-operator-release-notes.html

schwesig commented 4 months ago
jtriley commented 3 months ago

@schwesig Apologies for the delay I had to get some ACLs in place but ICMP should now be open to the switches from the observability cluster hosts. These are the current IPs to monitor via ping test:

10.30.4.[10-14]

schwesig commented 1 month ago

still working on figuring out if a simple ping could do the job, or if we need snmp, or focussing on the flow, or on https://docs.openshift.com/container-platform/4.13/observability/network_observability/network-observability-operator-release-notes.html

so far I am focussing on the ping, as entry solution.