Suitability of issue tickets as labels for log anomaly detection on MOC/OCP

drbwa commented 3 years ago

@hemajv @4n4nd Please allow me a number of questions regarding the issue tickets.

Consider an anomaly detector that learns to detect incidents/outages/failures based on an analysis of a stream of log messages being produced by a system. This anomaly detector learns in an unsupervised manner. However, we need some labels (sets of log messages that are indicative of specific issues) to tune and test the anomaly detector.

The overarching questions I would like to try and clarify are:

Are the tickets currently generated indicative of a variety of issues that can occur in MOC/OCP?
Can the issues for which we generate tickets be expected to be visible in log messages?
Does the representation of the issue tickets facilitate automated processing?

Let me try to confirm my understanding.

Are alerts configured based on data available in Prometheus? Are alerts based on metrics only (i.e. not on logs)?

Which components are being monitored and have alerts defined for them?

As far as I understand, the alerts currently configured will generate issue tickets for un/availability events. Is this correct? Do we currently alert on any other issues?

Do we have, or is there, some kind of fault model or something like a set of alert templates for monitoring an OCP deployment that we could use to expend the set of issues tickets are created for?

What metadata do tickets make available? For example, I suppose that the outage start time is represented by the ticket creation time. How are we going to track the end time of an outage?

Which metadata fields are going to be generated automatically and which other fields do we expect to be added manually, possibly as free text? In other words, is there a 'schema' for the fields in an issue ticket?

Are all tickets going to be generated automatically or will we also have manually created tickets (e.g., based on user complaints)? What is the template for manually created tickets?

/cc @davidohana @eranra @ronenschafferibm

hemajv commented 3 years ago

Thank you for opening the issue @drbwa! :smiley: These are some great questions and I'm not sure if we have the answers for all of them yet, but here are my thoughts.

Are the tickets currently generated indicative of a variety of issues that can occur in MOC/OCP?

The tickets currently generated are based on the availability alerts we have so far defined here: https://github.com/operate-first/apps/blob/master/odh/base/monitoring/overrides/prometheus-operator/overlays/alerts/prometheus-rules.yaml Right now, we have basic availability alerts which detect when any of the applications deployed on MOC go down, but we are planning to add more alerts as well.

Can the issues for which we generate tickets be expected to be visible in log messages?

The issues are being created by the GitHub receiver, which does generate some logs such as:

Screenshot from 2021-02-01 09-19-48

Does the representation of the issue tickets facilitate automated processing?

I think the following details may help answer this question:

Are alerts configured based on data available in Prometheus? Are alerts based on metrics only (i.e. not on logs)?

Yes, currently alerts are configured only based on the metrics we have available in Prometheus.

Which components are being monitored and have alerts defined for them?

As I pointed out, you can find our alerting rules defined here. The components we have them defined so far are for JupyterHub, ArgoCD, Grafana, Prometheus, and Observatorium .

As far as I understand, the alerts currently configured will generate issue tickets for un/availability events. Is this correct? Do we currently alert on any other issues?

Yes, currently we have defined only basic availability alerts, but we aim to define other alerts as well.

Do we have, or is there, some kind of fault model or something like a set of alert templates for monitoring an OCP deployment that we could use to expend the set of issues tickets are created for?

We currently do not have any such template created, but there are other teams at Red Hat such as the OSD team, app-SRE team who have some well defined alerts for their monitoring that we are looking into and would like to incorporate.

What metadata do tickets make available? For example, I suppose that the outage start time is represented by the ticket creation time. How are we going to track the end time of an outage?

Here is an example of what an issue looks like: https://github.com/operate-first/SRE/issues/49. There are some labels associated with the alert. For the timestamps, we will probably have to look at the logs from the GitHub receiver since they seem to have timestamps of when an alert was resolved.

Which metadata fields are going to be generated automatically and which other fields do we expect to be added manually, possibly as free text? In other words, is there a 'schema' for the fields in an issue ticket?

Currently, it seems that the GitHub receiver issue templates are static and are defined here: https://github.com/m-lab/alertmanager-github-receiver/blob/master/alerts/template.go#L54. If we want to provide additional fields, I believe we can send the changes upstream and they will review to accept the changes. I had a similar discussion with the upstream folks here.

Are all tickets going to be generated automatically or will we also have manually created tickets (e.g., based on user complaints)? What is the template for manually created tickets?

We aim to have most tickets generated automatically, but as you pointed out things like user complaints will need to be created manually. I think we are planning to follow templates similar to what we have defined here for our support repo, but we haven't defined it yet.

I hope this helps and please do let us know if you have any other questions! @4n4nd @HumairAK feel free to add anything else I may have missed.

drbwa commented 3 years ago

Thank you for your answers @hemajv . Let me share some thoughts.

Regardless of what mechanism you choose to use (e.g., GH issues, Jira, ServiceNow), there are three things that I think will serve you well a bit further down the road.

These tickets will become a repository of useful information over time. Ideally, as these tickets accumulate over time, they will allow us to build an understanding of what is going on in the system, where the hotspots are, gather statistics on different SRE activities (e.g., MTTA, MTTD, MTTR).

The first point is much easier said than done, but I guess it is something to strive for (and would be happy to help figure out how to get there). You want to have a repository of tickets for important issues, indicative or problems that SREs managing this or that component or OCP in general really care about (or at least, be able to identify those tickets).

Second, whatever you can do to enable processing these tickets in an automated manner to extract useful information down the road will be very useful.

Related to this, tickets should make loads of useful metadata available (in my view). For example, source of alert (automated or human), cause for triggering alert, severity and impact of issue, source of issue, various durations (time to detect, time to acknowledge, time to mitigate, time to repair), lifecycle events (was the ticket reopened several times, was it opened and closed again automatically or by human touch), categories of likely root causes, and so on.

I know that the above is in part more philosophical than actionable, but would be happy to help figure out how to get closer to a set of issue/incident tickets that can be mined for useful information.

sesheta commented 2 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

sesheta commented 2 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

sesheta commented 2 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

/close

sesheta commented 2 years ago

@sesheta: Closing this issue.

In response to [this](https://github.com/operate-first/SRE/issues/50#issuecomment-991387853): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

operate-first / support

Suitability of issue tickets as labels for log anomaly detection on MOC/OCP #936