Open fridex opened 3 years ago
Is your feature request related to a problem? Please describe.
As Thoth operator, I would like to know why solver failed in the cluster - (e.g. if they failed due to OOM)
As Thoth operator, I would like to know why advisers failed in the cluster - (e.g. wrong user inputs, ...).
Describe the solution you'd like
Have a metric that exposes information about exit code returned by the corresponding container in a workflow.
We can sync how these components return the exit code and the semantics behind these exit codes.
We have this metric already, it gives back the percentage of justifications with ERROR on the failed advisers, including when fails due to OOM or CPU exceeded, is it enough? Or do we need to look at the single exit codes? wdyt @fridex ?
We have this metric already, it gives back the percentage of justifications with ERROR on the failed advisers, including when fails due to OOM or CPU exceeded, is it enough? Or do we need to look at the single exit codes? wdyt @fridex ?
Are these computed by reported based on documents stored on ceph?
We have this metric already, it gives back the percentage of justifications with ERROR on the failed advisers, including when fails due to OOM or CPU exceeded, is it enough? Or do we need to look at the single exit codes? wdyt @fridex ?
Are these computed by reported based on documents stored on ceph?
Yes, analyze every morning for the day before by thoth reporter
, we can make this analysis more often during the day to collect more data points. wdyt?
Yes, analyze every morning for the day before by
thoth reporter
, we can make this analysis more often during the day to collect more data points. wdyt?
Daily sounds reasonable. 👍🏻
We have this metric already, it gives back the percentage of justifications with ERROR on the failed advisers, including when fails due to OOM or CPU exceeded, is it enough? Or do we need to look at the single exit codes? wdyt @fridex ?
So back to this one. An example to reason $SUBJ metric: as of now, our prod environment fails to give any recommendations as it is in an inconsistent state (https://github.com/thoth-station/thoth-application/issues/1766) - database queries expect platform
column but that column does not exist in the database, hence adviser fails with the following error (and corresponding exit code):
The resolution failed as an error was encountered: Failed to run pipeline boot 'PlatformBoot': (psycopg2.errors.UndefinedColumn) column depends_on.platform does not exist
LINE 3: WHERE depends_on.platform = 'linux-x86_64') AS anon_1
^
[SQL: SELECT EXISTS (SELECT *
FROM depends_on
WHERE depends_on.platform = %(platform_1)s) AS anon_1]
(Background on this error at: http://sqlalche.me/e/13/f405)
With metrics reported by the reporter, we will know about this issue one day later, not in real-time - that will not give us insights about the system - how the system works right now and what actions should be done to recover from the error state.
If the situation with an inconsistent system occurs accidentally again someday in the future, we should be alerted "recommender system is giving too many errors in adviser pods with these exit codes, system operator should have a look at it". That way, we will keep the system up and will make sure that if there is any misbehavior, the system operator should have a look at it immediately based on the alert (before users start to complain).
Inspecting exit codes is one thing, having info about failed workflows (e.g. platform fails to bring a pod up) is another thing to consider in this case.
Yes, analyze every morning for the day before by
thoth reporter
, we can make this analysis more often during the day to collect more data points. wdyt?Daily sounds reasonable. 👍🏻
We have this metric already, it gives back the percentage of justifications with ERROR on the failed advisers, including when fails due to OOM or CPU exceeded, is it enough? Or do we need to look at the single exit codes? wdyt @fridex ?
So back to this one. An example to reason $SUBJ metric: as of now, our prod environment fails to give any recommendations as it is in an inconsistent state (thoth-station/thoth-application#1766) - database queries expect
platform
column but that column does not exist in the database, hence adviser fails with the following error (and corresponding exit code):The resolution failed as an error was encountered: Failed to run pipeline boot 'PlatformBoot': (psycopg2.errors.UndefinedColumn) column depends_on.platform does not exist LINE 3: WHERE depends_on.platform = 'linux-x86_64') AS anon_1 ^ [SQL: SELECT EXISTS (SELECT * FROM depends_on WHERE depends_on.platform = %(platform_1)s) AS anon_1] (Background on this error at: http://sqlalche.me/e/13/f405)
With metrics reported by the reporter, we will know about this issue one day later, not in real-time - that will not give us insights about the system - how the system works right now and what actions should be done to recover from the error state.
If the situation with an inconsistent system occurs accidentally again someday in the future, we should be alerted "recommender system is giving too many errors in adviser pods with these exit codes, system operator should have a look at it". That way, we will keep the system up and will make sure that if there is any misbehavior, the system operator should have a look at it immediately based on the alert (before users start to complain).
Inspecting exit codes is one thing, having info about failed workflows (e.g. platform fails to bring a pod up) is another thing to consider in this case.
I see your point, in that case what justification is reported by adviser? So we need to find a way to read exit code of the pods to be reported immediately (we only have the percentage of adviser failures every moment and then asynchronously we analyze the reason from the documents on Ceph)
here errors are decreasnig but workflows failures are increasing (ocp4-stage), while succeeded one are not changing much
We have another metrics on number of requests vs number of reports created on Ceph at the moment (also evaluated async once per day from Ceph analysis), in that case if they do not match for long time, something wrong is happening in the system (e.g. Kafka off (another metrics is available for that), database is off)
in that case what justification is reported by adviser?
There is no justification created as the pod errored. adviser reports the followin error information:
"report": {
"ERROR": "An error occurred, see logs for more info"
}
We have another metrics on number of requests vs number of reports created on Ceph at the moment (also evaluated async once per day from Ceph analysis), in that case if they do not match for long time, something wrong is happening in the system (e.g. Kafka off (another metrics is available for that), database is off)
Yes, this metric discussed before on calls is not applicable to this case - in this case, the system produces documents, but does not satisfy user requests. The metric you brought introspects if system produces any documents (and should alert as well if not).
/priority important-soon /remove-triage needs-information /triage accept
@goern: The label(s) triage/accept
cannot be applied, because the repository doesn't have them.
/triage accepted
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
/lifecycle rotten
/remove-lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
/close
@sesheta: Closing this issue.
/priority important-longterm
/sig observability
Potentially relevant metrics
From kube-state-metrics:
kube_pod_container_status_last_terminated_reason
sample: kube_pod_container_status_last_terminated_reason{cluster="emea/balrog", container="acm-agent", endpoint="https-main", job="kube-state-metrics", namespace="open-cluster-management-agent-addon", pod="klusterlet-addon-workmgr-65c7c49798-z7jc2", prometheus="openshift-monitoring/k8s", reason="Error", service="kube-state-metrics"}
From argo workflow controller:
argo_worklow_error_count
sample: argo_workflows_error_count{cause="CronWorkflowSubmissionError", cluster="emea/balrog", endpoint="metrics", field="workflow-controller-metrics-thoth-backend-prod.apps.balrog.aws.operate-first.cloud", instance="10.128.2.40:8080", job="workflow-controller-metrics", namespace="thoth-backend-prod", pod="workflow-controller-58dccdddb6-49cv8", prometheus="openshift-user-workload-monitoring/user-workload", service="workflow-controller-metrics"}
+ all the metrics documented at https://argoproj.github.io/argo-workflows/metrics/#default-controller-metrics, probably
Beyond that, custom workflow metrics (metrics defined in Workflow spec, from what I gather) looks relevant.
relevant : https://github.com/kubernetes/kube-state-metrics/issues/1481 (the issue is only closed because it's old, not because it's refused).
Some opinions. If we only need exit codes, I don't think the application is the right level for implementing:
Since exit codes (most of them) we can use them to map to any reason we like. However, if the number of possible reason is unbounded (or just > 126) we'll probably want to use another mechanism.
(I'll unssagnim myself, I don't think we have a clear enough view of what we want to do yet with this) (and it was a little quiproquo on the sig call
I think we should use the kube-state-metrics feature once the previously linked PR is merged.
Unless someone has a different opinion, I propose we keep this frozen until the PR is merged and subsequent release of kube-state-metrics.
The kube-state-metrics got merged.
I'll keep an eye on this when they release a new version. /assign
kube-state-metrics do releases something like every 2/4 months, from their history. Last one was 16 days ago, so it might take some time before a new one.
Do we have an idea of what the timeline is for:
kube-state-metrics new releases -> get in Openshift -> get on the clusters we use ? I don't have much visibility on this.
Also, if we decide to go that route(=using kube-state-metrics) (do we ?), we should update the issue acceptance critera.
Suggestion: Acceptance critera:
Description:
Use kube_pod_container_status_last_terminated_exitcode
from
kube-state-metrics
in conjuction with labels from argo-workflows as the main
metric source for dashboard and alerts.
sounds good to me. which of the parts if on op1st and which on us?
They have the producer items (upgrade kube-state-metrics) we have the consumer (create the dasboard + alerts) ones (assuming those are handled as applications components in thoth-station)
@VannTen did you open in issue to update kube-state-metrics?
There isn't a release of kube-state-metrics with the merged PR yet, so I was thinking we should wait for it before opening an issue.
ACK /remove-lifecycle frozen
It looks like we should monitor https://github.com/openshift/cluster-monitoring-operator and/or https://github.com/openshift/kube-state-metrics .
I'll check the git history later to see if the exit_code PR is there, and in which release branch.
Is your feature request related to a problem? Please describe.
As Thoth operator, I would like to know why solver failed in the cluster - (e.g. if they failed due to OOM)
As Thoth operator, I would like to know why advisers failed in the cluster - (e.g. wrong user inputs, ...).
Describe the solution you'd like
Have a metric that exposes information about exit code returned by the corresponding container in a workflow.
We can sync how these components return the exit code and the semantics behind these exit codes.
Acceptance criteria