integr8ly / application-monitoring-operator

Operator for installing the Application Monitoring Stack on OpenShift (Prometheus, AlertManager, Grafana)
Apache License 2.0
30 stars 44 forks source link

Application monitoring operator fails to create Prometheus & Grafana instances #24

Closed jamesnetherton closed 5 years ago

jamesnetherton commented 5 years ago

Strange problem encountered when running through the installation instructions against a local Minishift VM.

The application-monitoring-operator deployment is created ok, but the logs contain messages like 'error creating resource: no matches for kind Prometheus'.

{"level":"info","ts":1549972257.0360773,"logger":"controller_applicationmonitoring","caller":"applicationmonitoring/applicationmonitoring_controller.go:154","msg":"Phase: Create Prometheus CRs"}
--
  | {"level":"info","ts":1549972257.0986924,"logger":"controller_applicationmonitoring","caller":"applicationmonitoring/applicationmonitoring_controller.go:158","msg":"Error in CreatePrometheusCRs, resourceName=prometheus : err=error creating resource: no matches for kind \"Prometheus\" in version \"monitoring.coreos.com/v1\""}
  | {"level":"error","ts":1549972257.0987484,"logger":"kubebuilder.controller","caller":"controller/controller.go:209","msg":"Reconciler error","Controller":"applicationmonitoring-controller","Request":"application-monitoring/example-applicationmonitoring","error":"error creating resource: no matches for kind \"Prometheus\" in version \"monitoring.coreos.com/v1\"","errorVerbose":"no matches for kind \"Prometheus\" in version \"monitoring.coreos.com/v1\"\nerror creating resource\ngithub.com/integr8ly/application-monitoring-operator/pkg/controller/applicationmonitoring.

I verified that the relevant CRDs are installed. But the operator seems stuck on this error and does not proceed to create the prometheus-operator or granfana-operator deployments.

Scaling application-monitoring-operator to 0 and back to 1 seems to fix the issue.

minishift: v1.28.0+48e89ed OpenShift: v3.11.0+d0c29df-98

david-martin commented 5 years ago

I wonder if this is a timing issue where the application-monitoring-operator has backed off in it's reconciliation loop as the CRD's weren't there. i.e. it was failing every loop so it backs off a little longer each time.

Then eventually the CRD's were created (by the prometheus-operator), but the application-monitoring-operator was still in a backoff. In this case, the reconciliation and Prometheus CR creation would eventually get created whenever the backoff time elapsed (could be many minutes?)

Either way, this seems a likely scenario to me, particularly on install in a fresh cluster where the images may need to be pulled, therefore causing a delay in startup.

@pb82 Would appreciate your thoughts on this

pb82 commented 5 years ago

@david-martin Yes I think this could be what's happening here. We first deploy the prometheus operator (see here) then continue with the Prometheus CR (see here).

There is probably not enough time between those steps to pull the prometheus operator image, deploy and start it. So the monitoring operator will run into the situation where it tries to deploy the CR but the CRD has not yet been created causing longer and longer backoffs.

The solution might be to add a phase where we wait for the prometheus operator to finish deployment before we continue.

robshelly commented 5 years ago

Seeing this issue too. It occurs consistently when installing on a new cluster

david-martin commented 5 years ago

@robshelly are you happy with changes in the linked PR for your use case?

abkieling commented 5 years ago

I don't think this issue is fixed. The installation only worked in the second try.

pb82 commented 5 years ago

@alexkieling Did you see the same error (where the custom resource definition was not available) or was it a different error? But if it worked on second try it seems like it. If it happens again, could you copy the logs and paste them here?

abkieling commented 5 years ago

I see the following error in the application-monitoring-operator logs:

{"level":"info","ts":1560856957.3430817,"logger":"controller_applicationmonitoring","caller":"applicationmonitoring/applicationmonitoring_controller.go:157","msg":"Error in InstallPrometheusOperator, resourceName=prometheus-operator-service-account : err=error creating resource: serviceaccounts \"prometheus-operator\" is forbidden: cannot set blockOwnerDeletion in this case because cannot find RESTMapping for APIVersion applicationmonitoring.integreatly.org/v1alpha1 Kind ApplicationMonitoring: no matches for kind \"ApplicationMonitoring\" in version \"applicationmonitoring.integreatly.org/v1alpha1\""}