Monitoring for additional events and metrics.

aabaris commented 1 year ago

In order to monitor the cluster for hardware faults, we would like to be able to monitor and alert on the following events:

[x] Server power supply failure
[x] Internal cooling fan failure
[x] Network interface errors
[ ] Server memory errors - (possibly using RDAC driver)
[ ] Local disk predictive failure - (not CEPH, using tools like smartctl or raid controller utilities)
[ ] Local disk filling up - (not CEPH, maybe a script with lsblk and df)
[ ] Automated operator upgrades.

Could you please let us know if this information is already being collected or how we could pursue the visibility of these events?

It would be most helpful for us to have some guidance on how to implement additional monitoring and metrics collection and integrate it into the monitoring infrastructure you have built. We are always adding checks to monitor a production system based on issues we experience. We would like to be able to collect data both from the operating systems level as well as service processors (ilo,bmc,idrac - depending on the hardware vendor).

Thank you!

Augustine

jtriley commented 1 year ago

It would also be helpful to have alerts for automated operator upgrades.

aabaris commented 1 year ago

Followup on discussion of metrics vs events.

Ideally all the parameters I listed in the issue should be treated as metrics.

In addition, memory and disk errors are important to track in event/system logs as well, because certain types of problems do get logged, but don't always show up in counters.

computate commented 1 year ago

@aabaris I would like to look more into the list of alerts you would like to see. Are you already monitoring any of these? Can we access this kind of data from either OpenShift metrics, or logs? I would like to get an idea because I'm not already familiar with how to report on some of these.

Server power supply failure
Memory erorrs
Local disk failure and predictive failure
Network interface errors
Local disk filesystems filling up
Internal cooling fan failure

aabaris commented 1 year ago

@computate there are more than one ways to get at each of these metrics. a lot of hardware events should be readable in-band and out-of-band. I'll provide some examples, but please keep in mind there are other ways of doing this as well.

Server power supply failure

$ ipmitool sdr type "Power Supply"
PS Redundancy    | 77h | ok  |  7.1 | Fully Redundant
Status           | 85h | ok  | 10.1 | Presence detected
Status           | 86h | ok  | 10.2 | Presence detected

memory errors Either vendor specific tools or EDAC driver Some relevant discussion here: https://access.redhat.com/discussions/3545531 and here: https://www.admin-magazine.com/HPC/Articles/Memory-Errors
local disk failure and predictive failure either smartctl tools or raid controller specific utilities
network interface erorrs hopefully this is the easiest one. prometheus node_exporter collects those by default, as well as /proc/net structures, as well as it's displayed in ifconfig output.
local disk filesystems filling up perhaps just running df on the host, but keeping in mind that sometimes df hangs when there's a host issue.
internal cooling fan failure ipmitool sdr or vendor specific tools.

On a more general level, maybe ability to collect data reported by ipmitool would be a good start. This could be done on the baremetal host OS level using the ipmitool utility and ipmi kernel driver. It could also be done out of band by using ipmitool to authenticate to remote service processor (bmc, idrac,ilo..etc), but that would require additional credentials and network access to OBM netowrk, which should come along with appropriate security discussions and considerations.

Also, this may or may not need additional issue to be filed, but last week an event caused some of prod cluster nodes to reboot. Node wrk-8,wrk-9,wrk-11,wrk-14 did not come back, we received no alerts about them being down. We would very much like to know if any of the cluster nodes are down. This is also an interesting example, because the FX chassis hosting these nodes has a flapping internal fan failure.

Lastly, regarding metrics vs logs approach, I think these are all metrics who's status should be queryable at any time or at interval of our choosing. In addition to that many of these events generate log entries, which we would consider to be desirable to identify. My feeling is metrics would be priority and additional log views a wish.

Please let me know if you have an questions, if I can provide some more specifics, or anything you'd like to discuss.

Thank you very much!

Augustine

computate commented 1 year ago

This PR is for network error metrics: https://github.com/OCP-on-NERC/nerc-ocp-config/pull/214

computate commented 1 year ago

These PRs are for IPMI metrics:

computate commented 1 year ago

@aabaris @larsks @jtriley Do you know of alerts we can set up for the new ipmi metrics?

ipmi_bmc_info
ipmi_chassis_power_state
ipmi_current_amperes
ipmi_current_state
ipmi_dcmi_power_consumption_watts
ipmi_fan_speed_rpm
ipmi_fan_speed_state
ipmi_power_state
ipmi_power_watts
ipmi_scrape_duration_seconds
ipmi_sensor_state
ipmi_sensor_value
ipmi_temperature_celsius
ipmi_temperature_state
ipmi_up

aabaris commented 1 year ago

@computate There is a lot of good information and it's overwhelming to me, but let's start with one:

ipmi_fan_speed_state - we want alert on anything that is not 0

In the metrics browser, the column we are interested shows up as "Value #A" and I can only sort, rather than filter on it. I will investigate and try to make a PR that gives this metric a more sensible name, but if you have a pointer on where to start I would appreciate it.

joachimweyl commented 1 year ago

@computate to help keep this organized I have updated the description to include check boxes for the pieces that were requested. Please use this as a way to track what pieces you have completed of the original request.

aabaris commented 1 year ago

@computate to help keep this organized I have updated the description to include check boxes for the pieces that were requested. Please use this as a way to track what pieces you have completed of the original request. The PR @computate provided checks off cooling and power supply issues.

Remaining items I listed are:

Memory errors Local disk failure and predictive failure Local disk filesystems filling up

I propose that I attempt adding monitoring and alerting for these, following the changes (PRs) made by @computate and @larsks when adding IPMI exporter and monitoring as examples.

computate commented 1 year ago

OK @aabaris , so is it true that these alerts that I defined are not sufficient for the remaining items that you listed?

CustomNetworkInterfaceErrors for memory errors
CustomCephStorageFillingUpPredicted for local disk failure and predictive failure
CustomCephStorageFillingUp for local disk filesystems filling up.

It sounds like you are willing to be assigned to this task from here, is that correct?

aabaris commented 1 year ago

OK @aabaris , so is it true that these alerts that I defined are not sufficient for the remaining items that you listed?
* `CustomNetworkInterfaceErrors` for memory errors

* `CustomCephStorageFillingUpPredicted` for local disk failure and predictive failure

* `CustomCephStorageFillingUp` for local disk filesystems filling up.
It sounds like you are willing to be assigned to this task from here, is that correct?

Not to be nitpicky but I think the list drifted slightly of course over time (understandably, with so many items on it)

Items are:

Server memory errors - (possibly using RDAC driver)
Local disk predictive failure - (not CEPH, using tools like smartctl or raid controller utilities)
Local disk filling up - (not CEPH, maybe a script with lsblk and df)

Yes, they can be assigned to me.

aabaris commented 1 year ago

It would also be helpful to have alerts for automated operator upgrades.

I agree and would like to add this to the main checklist of items. @joachimweyl, since you've made updates to this issue before, could you please advise on proper process for adding to the top level cheklist? Should I just edit directly or is there any other steps I need to take so it's tracked as a change?

joachimweyl commented 1 year ago

Yes, editing directly is fine. Github tracks all changes.

larsks commented 1 year ago

@aabaris some resources for writing custom exporters:

The textfile collector in node_exporter lets you export metrics from text files
If you end up writing your own exporter, there are some best practices here
There is a python module to help when writing exporters in Python
There is also a go client if that's your preference

There are a number of articles that have examples of using the Python module to write exporters (e.g 1, 2, etc).

aabaris commented 1 year ago

Update:

Established a place to host the deployment: https://github.com/nerc-project/nrpe-exporter

Planning on deploying on nerc-shift-0 cluster first before approaching the infra cluster.

Pursuing deployment of a pod with environment capable of executing already available NRPE check scripts. Evaluating existing projects that do this, but also considering using textfile collector option (suggested by Lars).

joachimweyl commented 1 year ago

@aabaris what progress have we made? Are there any blockers?

aabaris commented 1 year ago

@joachimweyl still working on it in my dev environment. not a lot of progress, because I've been working on higher priority task of adding GPU nodes to openstack.

joachimweyl commented 1 year ago

@aabaris letting you know this topic is back on the Wed meeting agenda.

joachimweyl commented 1 year ago

@aabaris is work on this continuing while we get a new Metrics cluster up and running or is this awaiting that work to move forward?

aabaris commented 1 year ago

I've taken a detour from this work, in light of what I learned about observabilty. I feel that we are not making the most of metrics that are already available.

I would prefer if we could take the time to refine the plan and approach before turning this into a formal bullet point project-managable list, but I'll share what I am pursuing:

1) visibility for when components are no longer reporting and reporting them as events (be it because observability is failing or a component goes offline). I am finding gaps in data in observability that get smoothed over time. I'm also investigating the "up" metric, which I'm suspecting is not propogating the same way in observability as it does in the regular openshift monitoring ( I need to dig in further to speak for this accurately )

2) making a view of alerts, separate from notifications: https://github.com/OCP-on-NERC/nerc-ocp-config/pull/259

3) comparing data that's in default prometheus internal to each individual openshift cluster, vs what gets pushed out to observabilty. (small steps https://github.com/OCP-on-NERC/nerc-ocp-config/commit/bcfa3969a881e3ca6e4c19a710202b3f703d5de3)

4) I would like to remove some of the existing alerts, or perhaps separate into a different tier. Alerts specific to workloads running on the cluster vs alerts that are relevant to overall health of the system. (for example if a process in some user pod using more than 80% CPU, in my world that is actually good - that means it's not blocked)

The metrics listed at the top of this issue will be worth pursuing, but I feel they won't be helpful to us until we are in a position to better take advantage of what at least in theory should be working already.

nerc-project / operations

Monitoring for additional events and metrics. #55