Have a metric that introspects why pods failed in the cluster

fridex commented 3 years ago

Is your feature request related to a problem? Please describe.

As Thoth operator, I would like to know why solver failed in the cluster - (e.g. if they failed due to OOM)

As Thoth operator, I would like to know why advisers failed in the cluster - (e.g. wrong user inputs, ...).

Describe the solution you'd like

Have a metric that exposes information about exit code returned by the corresponding container in a workflow.

We can sync how these components return the exit code and the semantics behind these exit codes.

Acceptance criteria

[ ] Explore the existing metrics, workflow metrics and document them for the pod failure based on the above case
[ ] Design the metrics to be included in metrics-exporter , to expose metrics for failure or error case
[ ] Design and implement dashboard pannel based on these metrics.

pacospace commented 3 years ago

Is your feature request related to a problem? Please describe.

As Thoth operator, I would like to know why solver failed in the cluster - (e.g. if they failed due to OOM)

As Thoth operator, I would like to know why advisers failed in the cluster - (e.g. wrong user inputs, ...).

Describe the solution you'd like

Have a metric that exposes information about exit code returned by the corresponding container in a workflow.

We can sync how these components return the exit code and the semantics behind these exit codes.

We have this metric already, it gives back the percentage of justifications with ERROR on the failed advisers, including when fails due to OOM or CPU exceeded, is it enough? Or do we need to look at the single exit codes? wdyt @fridex ?

pacospace commented 3 years ago

Screenshot from 2021-06-29 18-57-16

fridex commented 3 years ago

We have this metric already, it gives back the percentage of justifications with ERROR on the failed advisers, including when fails due to OOM or CPU exceeded, is it enough? Or do we need to look at the single exit codes? wdyt @fridex ?

Are these computed by reported based on documents stored on ceph?

pacospace commented 3 years ago

We have this metric already, it gives back the percentage of justifications with ERROR on the failed advisers, including when fails due to OOM or CPU exceeded, is it enough? Or do we need to look at the single exit codes? wdyt @fridex ?

Are these computed by reported based on documents stored on ceph?

Yes, analyze every morning for the day before by thoth reporter, we can make this analysis more often during the day to collect more data points. wdyt?

fridex commented 3 years ago

Yes, analyze every morning for the day before by thoth reporter, we can make this analysis more often during the day to collect more data points. wdyt?

Daily sounds reasonable. 👍🏻

We have this metric already, it gives back the percentage of justifications with ERROR on the failed advisers, including when fails due to OOM or CPU exceeded, is it enough? Or do we need to look at the single exit codes? wdyt @fridex ?

So back to this one. An example to reason $SUBJ metric: as of now, our prod environment fails to give any recommendations as it is in an inconsistent state (https://github.com/thoth-station/thoth-application/issues/1766) - database queries expect platform column but that column does not exist in the database, hence adviser fails with the following error (and corresponding exit code):

          The resolution failed as an error was encountered: Failed to run pipeline boot 'PlatformBoot': (psycopg2.errors.UndefinedColumn) column depends_on.platform does not exist           
                                                                LINE 3: WHERE depends_on.platform = 'linux-x86_64') AS anon_1                                                                  
                                                                                                     ^                                                                                         

                                                                                [SQL: SELECT EXISTS (SELECT *                                                                                  
                                                                                       FROM depends_on                                                                                         
                                                                    WHERE depends_on.platform = %(platform_1)s) AS anon_1]                                                                     

                                                                 (Background on this error at: http://sqlalche.me/e/13/f405)

With metrics reported by the reporter, we will know about this issue one day later, not in real-time - that will not give us insights about the system - how the system works right now and what actions should be done to recover from the error state.

If the situation with an inconsistent system occurs accidentally again someday in the future, we should be alerted "recommender system is giving too many errors in adviser pods with these exit codes, system operator should have a look at it". That way, we will keep the system up and will make sure that if there is any misbehavior, the system operator should have a look at it immediately based on the alert (before users start to complain).

Inspecting exit codes is one thing, having info about failed workflows (e.g. platform fails to bring a pod up) is another thing to consider in this case.

pacospace commented 3 years ago

Yes, analyze every morning for the day before by thoth reporter, we can make this analysis more often during the day to collect more data points. wdyt?

Daily sounds reasonable. 👍🏻

We have this metric already, it gives back the percentage of justifications with ERROR on the failed advisers, including when fails due to OOM or CPU exceeded, is it enough? Or do we need to look at the single exit codes? wdyt @fridex ?

So back to this one. An example to reason $SUBJ metric: as of now, our prod environment fails to give any recommendations as it is in an inconsistent state (thoth-station/thoth-application#1766) - database queries expect platform column but that column does not exist in the database, hence adviser fails with the following error (and corresponding exit code):
          The resolution failed as an error was encountered: Failed to run pipeline boot 'PlatformBoot': (psycopg2.errors.UndefinedColumn) column depends_on.platform does not exist           
                                                                LINE 3: WHERE depends_on.platform = 'linux-x86_64') AS anon_1                                                                  
                                                                                                     ^                                                                                         

                                                                                [SQL: SELECT EXISTS (SELECT *                                                                                  
                                                                                       FROM depends_on                                                                                         
                                                                    WHERE depends_on.platform = %(platform_1)s) AS anon_1]                                                                     

                                                                 (Background on this error at: http://sqlalche.me/e/13/f405)  
With metrics reported by the reporter, we will know about this issue one day later, not in real-time - that will not give us insights about the system - how the system works right now and what actions should be done to recover from the error state.

If the situation with an inconsistent system occurs accidentally again someday in the future, we should be alerted "recommender system is giving too many errors in adviser pods with these exit codes, system operator should have a look at it". That way, we will keep the system up and will make sure that if there is any misbehavior, the system operator should have a look at it immediately based on the alert (before users start to complain).

Inspecting exit codes is one thing, having info about failed workflows (e.g. platform fails to bring a pod up) is another thing to consider in this case.

I see your point, in that case what justification is reported by adviser? So we need to find a way to read exit code of the pods to be reported immediately (we only have the percentage of adviser failures every moment and then asynchronously we analyze the reason from the documents on Ceph)

Screenshot from 2021-06-30 10-17-58

here errors are decreasnig but workflows failures are increasing (ocp4-stage), while succeeded one are not changing much

We have another metrics on number of requests vs number of reports created on Ceph at the moment (also evaluated async once per day from Ceph analysis), in that case if they do not match for long time, something wrong is happening in the system (e.g. Kafka off (another metrics is available for that), database is off)

fridex commented 3 years ago

in that case what justification is reported by adviser?

There is no justification created as the pod errored. adviser reports the followin error information:

    "report": {
      "ERROR": "An error occurred, see logs for more info"
    }

We have another metrics on number of requests vs number of reports created on Ceph at the moment (also evaluated async once per day from Ceph analysis), in that case if they do not match for long time, something wrong is happening in the system (e.g. Kafka off (another metrics is available for that), database is off)

Yes, this metric discussed before on calls is not applicable to this case - in this case, the system produces documents, but does not satisfy user requests. The metric you brought introspects if system produces any documents (and should alert as well if not).

goern commented 3 years ago

/priority important-soon /remove-triage needs-information /triage accept

sesheta commented 3 years ago

@goern: The label(s) triage/accept cannot be applied, because the repository doesn't have them.

In response to [this](https://github.com/thoth-station/metrics-exporter/issues/725#issuecomment-871408669): >/priority important-soon >/remove-triage needs-information >/triage accept Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

goern commented 3 years ago

/triage accepted

sesheta commented 3 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

fridex commented 3 years ago

/remove-lifecycle rotten

sesheta commented 3 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

/close

sesheta commented 3 years ago

@sesheta: Closing this issue.

In response to [this](https://github.com/thoth-station/metrics-exporter/issues/725#issuecomment-907838463): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

goern commented 2 years ago

/priority important-longterm

codificat commented 2 years ago

/sig observability

VannTen commented 2 years ago

Potentially relevant metrics

From kube-state-metrics:

kube_pod_container_status_last_terminated_reason sample: kube_pod_container_status_last_terminated_reason{cluster="emea/balrog", container="acm-agent", endpoint="https-main", job="kube-state-metrics", namespace="open-cluster-management-agent-addon", pod="klusterlet-addon-workmgr-65c7c49798-z7jc2", prometheus="openshift-monitoring/k8s", reason="Error", service="kube-state-metrics"}

From argo workflow controller:

argo_worklow_error_count sample: argo_workflows_error_count{cause="CronWorkflowSubmissionError", cluster="emea/balrog", endpoint="metrics", field="workflow-controller-metrics-thoth-backend-prod.apps.balrog.aws.operate-first.cloud", instance="10.128.2.40:8080", job="workflow-controller-metrics", namespace="thoth-backend-prod", pod="workflow-controller-58dccdddb6-49cv8", prometheus="openshift-user-workload-monitoring/user-workload", service="workflow-controller-metrics"}

+ all the metrics documented at https://argoproj.github.io/argo-workflows/metrics/#default-controller-metrics, probably

Beyond that, custom workflow metrics (metrics defined in Workflow spec, from what I gather) looks relevant.

VannTen commented 2 years ago

relevant : https://github.com/kubernetes/kube-state-metrics/issues/1481 (the issue is only closed because it's old, not because it's refused).

VannTen commented 2 years ago

Some opinions. If we only need exit codes, I don't think the application is the right level for implementing:

The application can't access the exit codes, by definition.
It's universal for any kind of containers (process, in fact). IMO, this point to implementing the linked feature request in kube-state-metrics. -> And in fact I had not seen it, but there is a PR linked https://github.com/kubernetes/kube-state-metrics/pull/1752

Since exit codes (most of them) we can use them to map to any reason we like. However, if the number of possible reason is unbounded (or just > 126) we'll probably want to use another mechanism.

VannTen commented 2 years ago

(I'll unssagnim myself, I don't think we have a clear enough view of what we want to do yet with this) (and it was a little quiproquo on the sig call

VannTen commented 2 years ago

I think we should use the kube-state-metrics feature once the previously linked PR is merged.

Unless someone has a different opinion, I propose we keep this frozen until the PR is merged and subsequent release of kube-state-metrics.

VannTen commented 2 years ago

The kube-state-metrics got merged.

I'll keep an eye on this when they release a new version. /assign

VannTen commented 2 years ago

kube-state-metrics do releases something like every 2/4 months, from their history. Last one was 16 days ago, so it might take some time before a new one.

Do we have an idea of what the timeline is for:

kube-state-metrics new releases -> get in Openshift -> get on the clusters we use ? I don't have much visibility on this.

Also, if we decide to go that route(=using kube-state-metrics) (do we ?), we should update the issue acceptance critera.

Suggestion: Acceptance critera:

[ ] upgrade to (yet-to-be-released) version of kube-state-metrics on our environments.
[ ] Implement dashboard for historical data analysis using
[ ] Implements alerts for abnormal behavior detection.

Description:

Use kube_pod_container_status_last_terminated_exitcode from kube-state-metrics in conjuction with labels from argo-workflows as the main metric source for dashboard and alerts.

goern commented 2 years ago

sounds good to me. which of the parts if on op1st and which on us?

VannTen commented 2 years ago

They have the producer items (upgrade kube-state-metrics) we have the consumer (create the dasboard + alerts) ones (assuming those are handled as applications components in thoth-station)

goern commented 2 years ago

@VannTen did you open in issue to update kube-state-metrics?

VannTen commented 2 years ago

There isn't a release of kube-state-metrics with the merged PR yet, so I was thinking we should wait for it before opening an issue.

goern commented 2 years ago

ACK /remove-lifecycle frozen

VannTen commented 1 year ago

It looks like we should monitor https://github.com/openshift/cluster-monitoring-operator and/or https://github.com/openshift/kube-state-metrics .

I'll check the git history later to see if the exit_code PR is there, and in which release branch.

thoth-station / metrics-exporter

Have a metric that introspects why pods failed in the cluster #725