Open aabaris opened 1 year ago
It would also be helpful to have alerts for automated operator upgrades.
Followup on discussion of metrics vs events.
Ideally all the parameters I listed in the issue should be treated as metrics.
In addition, memory and disk errors are important to track in event/system logs as well, because certain types of problems do get logged, but don't always show up in counters.
@aabaris I would like to look more into the list of alerts you would like to see. Are you already monitoring any of these? Can we access this kind of data from either OpenShift metrics, or logs? I would like to get an idea because I'm not already familiar with how to report on some of these.
@computate there are more than one ways to get at each of these metrics. a lot of hardware events should be readable in-band and out-of-band. I'll provide some examples, but please keep in mind there are other ways of doing this as well.
Server power supply failure
$ ipmitool sdr type "Power Supply"
PS Redundancy | 77h | ok | 7.1 | Fully Redundant
Status | 85h | ok | 10.1 | Presence detected
Status | 86h | ok | 10.2 | Presence detected
memory errors Either vendor specific tools or EDAC driver Some relevant discussion here: https://access.redhat.com/discussions/3545531 and here: https://www.admin-magazine.com/HPC/Articles/Memory-Errors
local disk failure and predictive failure either smartctl tools or raid controller specific utilities
network interface erorrs hopefully this is the easiest one. prometheus node_exporter collects those by default, as well as /proc/net structures, as well as it's displayed in ifconfig output.
local disk filesystems filling up
perhaps just running df
on the host, but keeping in mind that sometimes df hangs when there's a host issue.
internal cooling fan failure
ipmitool sdr
or vendor specific tools.
On a more general level, maybe ability to collect data reported by ipmitool would be a good start. This could be done on the baremetal host OS level using the ipmitool utility and ipmi kernel driver. It could also be done out of band by using ipmitool to authenticate to remote service processor (bmc, idrac,ilo..etc), but that would require additional credentials and network access to OBM netowrk, which should come along with appropriate security discussions and considerations.
Also, this may or may not need additional issue to be filed, but last week an event caused some of prod cluster nodes to reboot. Node wrk-8,wrk-9,wrk-11,wrk-14 did not come back, we received no alerts about them being down. We would very much like to know if any of the cluster nodes are down. This is also an interesting example, because the FX chassis hosting these nodes has a flapping internal fan failure.
Lastly, regarding metrics vs logs approach, I think these are all metrics who's status should be queryable at any time or at interval of our choosing. In addition to that many of these events generate log entries, which we would consider to be desirable to identify. My feeling is metrics would be priority and additional log views a wish.
Please let me know if you have an questions, if I can provide some more specifics, or anything you'd like to discuss.
Thank you very much!
This PR is for network error metrics: https://github.com/OCP-on-NERC/nerc-ocp-config/pull/214
These PRs are for IPMI metrics:
@aabaris @larsks @jtriley Do you know of alerts we can set up for the new ipmi metrics?
@computate There is a lot of good information and it's overwhelming to me, but let's start with one:
ipmi_fan_speed_state - we want alert on anything that is not 0
In the metrics browser, the column we are interested shows up as "Value #A" and I can only sort, rather than filter on it. I will investigate and try to make a PR that gives this metric a more sensible name, but if you have a pointer on where to start I would appreciate it.
@computate to help keep this organized I have updated the description to include check boxes for the pieces that were requested. Please use this as a way to track what pieces you have completed of the original request.
@computate to help keep this organized I have updated the description to include check boxes for the pieces that were requested. Please use this as a way to track what pieces you have completed of the original request. The PR @computate provided checks off cooling and power supply issues.
Remaining items I listed are:
Memory errors Local disk failure and predictive failure Local disk filesystems filling up
I propose that I attempt adding monitoring and alerting for these, following the changes (PRs) made by @computate and @larsks when adding IPMI exporter and monitoring as examples.
OK @aabaris , so is it true that these alerts that I defined are not sufficient for the remaining items that you listed?
CustomNetworkInterfaceErrors
for memory errorsCustomCephStorageFillingUpPredicted
for local disk failure and predictive failureCustomCephStorageFillingUp
for local disk filesystems filling up. It sounds like you are willing to be assigned to this task from here, is that correct?
OK @aabaris , so is it true that these alerts that I defined are not sufficient for the remaining items that you listed?
* `CustomNetworkInterfaceErrors` for memory errors * `CustomCephStorageFillingUpPredicted` for local disk failure and predictive failure * `CustomCephStorageFillingUp` for local disk filesystems filling up.
It sounds like you are willing to be assigned to this task from here, is that correct?
Not to be nitpicky but I think the list drifted slightly of course over time (understandably, with so many items on it)
Items are:
Yes, they can be assigned to me.
It would also be helpful to have alerts for automated operator upgrades.
I agree and would like to add this to the main checklist of items. @joachimweyl, since you've made updates to this issue before, could you please advise on proper process for adding to the top level cheklist? Should I just edit directly or is there any other steps I need to take so it's tracked as a change?
Yes, editing directly is fine. Github tracks all changes.
@aabaris some resources for writing custom exporters:
textfile
collector in node_exporter
lets you export metrics from text filesThere are a number of articles that have examples of using the Python module to write exporters (e.g 1, 2, etc).
Update:
Established a place to host the deployment: https://github.com/nerc-project/nrpe-exporter
Planning on deploying on nerc-shift-0 cluster first before approaching the infra cluster.
Pursuing deployment of a pod with environment capable of executing already available NRPE check scripts. Evaluating existing projects that do this, but also considering using textfile collector option (suggested by Lars).
@aabaris what progress have we made? Are there any blockers?
@joachimweyl still working on it in my dev environment. not a lot of progress, because I've been working on higher priority task of adding GPU nodes to openstack.
@aabaris letting you know this topic is back on the Wed meeting agenda.
@aabaris is work on this continuing while we get a new Metrics cluster up and running or is this awaiting that work to move forward?
I've taken a detour from this work, in light of what I learned about observabilty. I feel that we are not making the most of metrics that are already available.
I would prefer if we could take the time to refine the plan and approach before turning this into a formal bullet point project-managable list, but I'll share what I am pursuing:
1) visibility for when components are no longer reporting and reporting them as events (be it because observability is failing or a component goes offline). I am finding gaps in data in observability that get smoothed over time. I'm also investigating the "up" metric, which I'm suspecting is not propogating the same way in observability as it does in the regular openshift monitoring ( I need to dig in further to speak for this accurately )
2) making a view of alerts, separate from notifications: https://github.com/OCP-on-NERC/nerc-ocp-config/pull/259
3) comparing data that's in default prometheus internal to each individual openshift cluster, vs what gets pushed out to observabilty. (small steps https://github.com/OCP-on-NERC/nerc-ocp-config/commit/bcfa3969a881e3ca6e4c19a710202b3f703d5de3)
4) I would like to remove some of the existing alerts, or perhaps separate into a different tier. Alerts specific to workloads running on the cluster vs alerts that are relevant to overall health of the system. (for example if a process in some user pod using more than 80% CPU, in my world that is actually good - that means it's not blocked)
The metrics listed at the top of this issue will be worth pursuing, but I feel they won't be helpful to us until we are in a position to better take advantage of what at least in theory should be working already.
In order to monitor the cluster for hardware faults, we would like to be able to monitor and alert on the following events:
Could you please let us know if this information is already being collected or how we could pursue the visibility of these events?
It would be most helpful for us to have some guidance on how to implement additional monitoring and metrics collection and integrate it into the monitoring infrastructure you have built. We are always adding checks to monitor a production system based on issues we experience. We would like to be able to collect data both from the operating systems level as well as service processors (ilo,bmc,idrac - depending on the hardware vendor).
Thank you!