integr8ly / application-monitoring-operator

Operator for installing the Application Monitoring Stack on OpenShift (Prometheus, AlertManager, Grafana)
Apache License 2.0
30 stars 44 forks source link

wait for operator before deploying crs #35

Closed pb82 closed 5 years ago

pb82 commented 5 years ago

Wait for the prometheus operator to be ready (and finish deployment of it's CRDs) before continuing to install Prometheus and Alertmanager.

Fixes https://github.com/integr8ly/application-monitoring-operator/issues/24

Verification steps:

  1. Use a fresh cluster where the images have not been pulled yet
  2. Use a namespace called application-monitoring or change the namespace here: https://github.com/integr8ly/application-monitoring-operator/blob/master/deploy/roles/prometheus-operator-clusterrole_binding.yaml#L12
  3. Run the operator locally (or deploy it) against this namespace and create the ApplicationMonitoring CR: oc create -f deploy/examples/ApplicationMonitoring.yaml
  4. None of the Operators should fail and the stack should be deployed
david-martin commented 5 years ago

Using minishift locally, I'm hitting some errors in the application-monitoring-operator logs. https://gist.github.com/david-martin/287e65f5b1eede02f29732cb272ea29a ~I'll look into this more, but seems like a permission issue~ Looks like an issue with the host

Error getting host from route: host value empty\ngithub.com/integr8ly/application-monitoring-operator/pkg/controller/applicationmonitoring.(*ReconcileApplicationMonitoring).GetHostFromRoute\n\t/home/travis/gopath/src/github.com/integr8ly/application-monitoring-operator/pkg/controller/applicationmonitoring/applicationmonitoring_controller.go:293\ngithub.com/integr8ly/application-monitoring-operator/pkg/controller/applicationmonitoring.(*ReconcileApplicationMonitoring).CreatePrometheusCRs\n\t/home/travis/gopath/src/github.com/integr8ly/application-monitoring-operator/pkg/controller/applicationmonitoring/applicationmonitoring_controller.go:166\ngithub.com/integr8ly/application-monitoring-operator/pkg/controller/applicationmonitoring.(*ReconcileApplicationMonitoring).Reconcile\n\t/home/travis/gopath/src/github.com/integr8ly/application-monitoring-operator/pkg/controller/applicationmonitoring/applicationmonitoring_controller.go:126\ngithub.com/integr8ly/application-monitoring-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/travis/gopath/src/github.com/integr8ly/application-monitoring-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:207\ngithub.com/integr8ly/application-monitoring-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/home/travis/gopath/src/github.com/integr8ly/application-monitoring-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:157\ngithub.com/integr8ly/application-monitoring-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/home/travis/gopath/src/github.com/integr8ly/application-monitoring-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\ngithub.com/integr8ly/application-monitoring-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/home/travis/gopath/src/github.com/integr8ly/application-monitoring-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\ngithub.com/integr8ly/application-monitoring-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/home/travis/gopath/src/github.com/integr8ly/application-monitoring-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88\nruntime.goexit\n\t/home/travis/.gimme/versions/go1.10.8.linux.amd64/src/runtime/asm_amd64.s:2361","stacktrace":"github.com/integr8ly/application-monitoring-operator/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/home/travis/gopath/src/github.com/integr8ly/application-monitoring-operator/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/integr8ly/application-monitoring-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/travis/gopath/src/github.com/integr8ly/application-monitoring-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:209\ngithub.com/integr8ly/application-monitoring-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/home/travis/gopath/src/github.com/integr8ly/application-monitoring-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:157\ngithub.com/integr8ly/application-monitoring-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/home/travis/gopath/src/github.com/integr8ly/application-monitoring-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\ngithub.com/integr8ly/application-monitoring-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/home/travis/gopath/src/github.com/integr8ly/application-monitoring-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\ngithub.com/integr8ly/application-monitoring-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/home/travis/gopath/src/github.com/integr8ly/application-monitoring-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88

What's unusual is that the route exists, and it has a host

oc get route prometheus-route --template '{{.spec.host}}'
prometheus-route-application-monitoring.192.168.64.2.nip.io

From looking at the code from the 0.0.8 tag, it doesn't make sense because there would be a different error if the route or host didn't exist, or it was looking in the wrong namespace. https://github.com/integr8ly/application-monitoring-operator/blob/0.0.8/pkg/controller/applicationmonitoring/applicationmonitoring_controller.go#L293

To rule out permissions, I can get the host from the route as the serviceaccount

oc get route prometheus-route --as='system:serviceaccount:application-monitoring:application-monitoring-operator' --template '{{.spec.host}}'
prometheus-route-application-monitoring.192.168.64.2.nip.io
pb82 commented 5 years ago

Hmm, we rely on OpenShift to assign a host to the route. I wonder if this could be another timing issue where we try to get the host value from the route before OpenShift had a chance to assign it.

david-martin commented 5 years ago

try to get the host value from the route before OpenShift had a chance to assign it

It doesn't seem like this as they host is definitely set, and the error happens repeatedly thereafter. I had restarted the pod to rule out some in-memory state or caching, but to no avail. I need to do some more thinking on this one as it doesn't make sense.

grdryn commented 5 years ago

I see that route hose issue on master (I haven't deployed this PR).

grdryn commented 5 years ago

I might try to check if more logging here would give more info: https://github.com/integr8ly/application-monitoring-operator/blob/master/pkg/controller/applicationmonitoring/applicationmonitoring_controller.go#L288..L290

Alternatively, maybe using github.com/openshift/api/route/v1 rather than the current more loosely-typed implementation might help?

grdryn commented 5 years ago
{"level":"info","ts":1557919806.6345694,"logger":"controller_applicationmonitoring","caller":"applicationmonitoring/applicationmonitoring_controller.go:291","msg":"Spec=map[port:map[targetPort:web] tls:map[termination:Reencrypt] to:map[kind:Service name:prometheus-service] wildcardPolicy:None]"}

It has everything but the host. :thinking:

Edit: I believe the problem is that the Route is never actually read from OpenShift, it's just the object instantiated from the template (that we do an r.client.Create with, but never actually read back once OpenShift has added the host field).

Did y'all used to add the host field value directly in the template for this to work?

I think I'll have a PR shortly to fix this route host issue.

grdryn commented 5 years ago

I think I'll have a PR shortly to fix this route host issue.

See #38

pb82 commented 5 years ago

@grdryn We don't set the host value in the template, we let OpenShift do that. It's still a mystery to me why this works sometimes and sometimes doesn't. But i'll give your PR a try.

pb82 commented 5 years ago

@david-martin could you retry this please? @grdryn 's changes are applied now and this should fix the problem reading back the host value.

pb82 commented 5 years ago

@david-martin When / how did the image registry change from where Grafana and Prometheus are pulled? The pull secret is somethinng we recently had to deal with for fuse and it's handled in the installer. We likely need to do the same here. But i agree that we should have general instructions for how to set this up independent of the installer.

grdryn commented 5 years ago

I ran the operator locally and used v0.8.0 of the operator-sdk to avoid this issue operator-framework/operator-sdk#1329, so you may want to build with that before pushing a new release of the image to quay.

Do you guys have any process or guidelines around moving to a new version of the operator-sdk?

grdryn commented 5 years ago

I also hit that issue with pulling from registry.redhat.io, but I just replaced the image references with the upstream ones from quay.io/docker.io

david-martin commented 5 years ago

When / how did the image registry change from where Grafana and Prometheus are pulled

@pb82

pb82 commented 5 years ago

@david-martin merging this and dealing with the readme update for the pull secret and the sdk update in another PR