grafana / grafana-operator

An operator for Grafana that installs and manages Grafana instances, Dashboards and Datasources through Kubernetes/OpenShift CRs
https://grafana.github.io/grafana-operator/
Apache License 2.0
923 stars 397 forks source link

[Bug] Can't create Dashboards #404

Closed alrf closed 3 years ago

alrf commented 3 years ago

Describe the bug I can't create dashboards after updating to Operator v3.10.0.

The errors are:

{"level":"error","ts":1620139043.2435157,"logger":"controller_grafanadashboard","msg":"failed to get or create namespace folder Dev for dashboard ","error":"Get \"http://admin:***@grafana-service:3000/api/folders\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/home/hstefans/go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\ngithub.com/integr8ly/grafana-operator/v3/pkg/controller/grafanadashboard.(*ReconcileGrafanaDashboard).reconcileDashboards\n\tgrafana-operator/pkg/controller/grafanadashboard/dashboard_controller.go:253\ngithub.com/integr8ly/grafana-operator/v3/pkg/controller/grafanadashboard.(*ReconcileGrafanaDashboard).Reconcile\n\tgrafana-operator/pkg/controller/grafanadashboard/dashboard_controller.go:136\ngithub.com/integr8ly/grafana-operator/v3/pkg/controller/grafanadashboard.add.func1\n\tgrafana-operator/pkg/controller/grafanadashboard/dashboard_controller.go:86\ngithub.com/integr8ly/grafana-operator/v3/pkg/controller/grafanadashboard.add.func2\n\tgrafana-operator/pkg/controller/grafanadashboard/dashboard_controller.go:92"}
{"level":"error","ts":1620139043.2436557,"logger":"controller_grafanadashboard","msg":"error updating dashboard","error":"Get \"http://admin:***@grafana-service:3000/api/folders\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/home/hstefans/go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\ngithub.com/integr8ly/grafana-operator/v3/pkg/controller/grafanadashboard.(*ReconcileGrafanaDashboard).manageError\n\tgrafana-operator/pkg/controller/grafanadashboard/dashboard_controller.go:366\ngithub.com/integr8ly/grafana-operator/v3/pkg/controller/grafanadashboard.(*ReconcileGrafanaDashboard).reconcileDashboards\n\tgrafana-operator/pkg/controller/grafanadashboard/dashboard_controller.go:254\ngithub.com/integr8ly/grafana-operator/v3/pkg/controller/grafanadashboard.(*ReconcileGrafanaDashboard).Reconcile\n\tgrafana-operator/pkg/controller/grafanadashboard/dashboard_controller.go:136\ngithub.com/integr8ly/grafana-operator/v3/pkg/controller/grafanadashboard.add.func1\n\tgrafana-operator/pkg/controller/grafanadashboard/dashboard_controller.go:86\ngithub.com/integr8ly/grafana-operator/v3/pkg/controller/grafanadashboard.add.func2\n\tgrafana-operator/pkg/controller/grafanadashboard/dashboard_controller.go:92"}
alrf commented 3 years ago

I've just additionally checked: everything works fine with v3.9.0.

pb82 commented 3 years ago

@alrf Are you using the preferService option in the Grafana CR? There was a small change introduced in 3.10.0 with regards to that. It will now rely on the service name instead of the IP address.

alrf commented 3 years ago

@pb82 yes, I have this setting in the Grafana CR:

  client:
    preferService: True
pb82 commented 3 years ago

@alrf As a workaround, can you set preferService to false and let the Operator create a Route/Ingress? Does that work? I'll try to reproduce the issue you're having. Can you curl the hostname of the Grafana service from the operator pod? If not it could be a networking issue on your cluster.

alrf commented 3 years ago

@pb82 I've tried preferService: False with Operator v3.10.0 and v3.10.1, the same result.

{"level":"error","ts":1621512787.8417025,"logger":"controller_grafanadashboard","msg":"error updating dashboard","error":"error creating folder, expected status 200 but got 403","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/briangallagher/go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\ngithub.com/integr8ly/grafana-operator/v3/pkg/controller/grafanadashboard.(*ReconcileGrafanaDashboard).manageError\n\tgrafana-operator/pkg/controller/grafanadashboard/dashboard_controller.go:366\ngithub.com/integr8ly/grafana-operator/v3/pkg/controller/grafanadashboard.(*ReconcileGrafanaDashboard).reconcileDashboards\n\tgrafana-operator/pkg/controller/grafanadashboard/dashboard_controller.go:254\ngithub.com/integr8ly/grafana-operator/v3/pkg/controller/grafanadashboard.(*ReconcileGrafanaDashboard).Reconcile\n\tgrafana-operator/pkg/controller/grafanadashboard/dashboard_controller.go:136\ngithub.com/integr8ly/grafana-operator/v3/pkg/controller/grafanadashboard.add.func1\n\tgrafana-operator/pkg/controller/grafanadashboard/dashboard_controller.go:86\ngithub.com/integr8ly/grafana-operator/v3/pkg/controller/grafanadashboard.add.func2\n\tgrafana-operator/pkg/controller/grafanadashboard/dashboard_controller.go:92"}
{"level":"error","ts":1621512787.855196,"logger":"controller_grafanadashboard","msg":"failed to get or create namespace folder Dev for dashboard ","error":"error creating folder, expected status 200 but got 403","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/Users/briangallagher/go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\ngithub.com/integr8ly/grafana-operator/v3/pkg/controller/grafanadashboard.(*ReconcileGrafanaDashboard).reconcileDashboards\n\tgrafana-operator/pkg/controller/grafanadashboard/dashboard_controller.go:253\ngithub.com/integr8ly/grafana-operator/v3/pkg/controller/grafanadashboard.(*ReconcileGrafanaDashboard).Reconcile\n\tgrafana-operator/pkg/controller/grafanadashboard/dashboard_controller.go:136\ngithub.com/integr8ly/grafana-operator/v3/pkg/controller/grafanadashboard.add.func1\n\tgrafana-operator/pkg/controller/grafanadashboard/dashboard_controller.go:86\ngithub.com/integr8ly/grafana-operator/v3/pkg/controller/grafanadashboard.add.func2\n\tgrafana-operator/pkg/controller/grafanadashboard/dashboard_controller.go:92"}

Part of config:

  service:
    ports:
      - name: grafana-proxy
        port: 9091
        protocol: TCP
        targetPort: grafana-proxy
    annotations:
      service.alpha.openshift.io/serving-cert-secret-name: grafana-k8s-tls
  ingress:
    enabled: True
    targetPort: grafana-proxy
    termination: reencrypt
    hostname: grafana.replaced-with-my-domain.com
  client:
    preferService: False

Can't connect via curl from the operator pod to port 3000:

bash-4.4$ curl http://grafana-service.grafana-operator.svc.cluster.local:3000
^C

but can connect to port 9091:

bash-4.4$ curl https://grafana-service.grafana-operator.svc.cluster.local:9091 -sk
<!DOCTYPE html>
<html lang="en" charset="utf-8">
<head>
  <title>Log In</title>

Could it be because of grafana-proxy using? (the operator is deployed on Openshift).

p.s.: also tried to curl using preferService: True, the same result.

alrf commented 3 years ago

Hi @pb82 I've checked and compared v3.9.0 and v3.10.0 tags and found:

grafana-operator]$ git diff v3.9.0 v3.10.0 -- pkg/controller/grafana/grafana_controller.go
diff --git a/pkg/controller/grafana/grafana_controller.go b/pkg/controller/grafana/grafana_controller.go
index 42f58307..c81c030c 100644
--- a/pkg/controller/grafana/grafana_controller.go
+++ b/pkg/controller/grafana/grafana_controller.go
@@ -249,10 +249,7 @@ func (r *ReconcileGrafana) getGrafanaAdminUrl(cr *grafanav1alpha1.Grafana, state
        var servicePort = int32(model.GetGrafanaPort(cr))

        // Otherwise rely on the service
-       if state.GrafanaService != nil && state.GrafanaService.Spec.ClusterIP != "" && state.GrafanaService.Spec.ClusterIP != "None" {
-               return fmt.Sprintf("http://%v:%d", state.GrafanaService.Spec.ClusterIP,
-                       servicePort), nil
-       } else if state.GrafanaService != nil {
+       if state.GrafanaService != nil {
                return fmt.Sprintf("http://%v:%d", state.GrafanaService.Name,
                        servicePort), nil
        }

So, before it worked through the IP address (ClusterIP) and now through the GrafanaService.Name. But to request a service in kubernetes you need to use http://<service_name>.<namespace>.svc.<zone>:<port> (e.g. http://grafana-service.grafana-operator.svc.cluster.local:3000) and not just a service name. I guess it can be an issue in this case. https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#services https://github.com/kubernetes/dns/blob/master/docs/specification.md

pb82 commented 3 years ago

@alrf Port 3000 should always be accessible without authentication, even if you use the OAuth proxy. Usually the oauth proxy is set up by adding another port to the service and then changing the route to point to the proxy port. We have an exampel for that setup: https://github.com/integr8ly/grafana-operator/blob/master/deploy/examples/oauth/Grafana.yaml

Does your service.ports setup look similar? When using this example, the service should have the following ports:

spec:
  ports:
    - name: grafana
      protocol: TCP
      port: 3000
      targetPort: grafana-http
    - name: grafana-proxy
      protocol: TCP
      port: 9091
      targetPort: grafana-proxy
alrf commented 3 years ago

Does your service.ports setup look similar? When using this example, the service should have the following ports

yes, like it is in examples (https://github.com/integr8ly/grafana-operator/blob/master/deploy/examples/oauth/Grafana.yaml)

alrf commented 3 years ago

@pb82 any updates?

pb82 commented 3 years ago

@alrf there is a change incoming where we fix the service DNS name: #438

Once that's landed, i'll reach out and we can give it another try.

alrf commented 3 years ago

@pb82 in v3.9.0 I can reach grafana-service from operator pod by name:

bash-4.4$ curl -I http://grafana-service:3000
HTTP/1.1 200 OK
Cache-Control: no-cache
Content-Type: text/html; charset=UTF-8
Expires: -1
Pragma: no-cache
X-Content-Type-Options: nosniff
X-Frame-Options: deny
X-Xss-Protection: 1; mode=block
Date: Wed, 23 Jun 2021 15:37:59 GMT

So, maybe DNS name is unrelated.. Another option for v3.10 could be using try/catch (on try - request grafana-service by name, on catch - by service IP address).

pb82 commented 3 years ago

@alrf can you please try 3.10.2? The issue should be fixed there.

alrf commented 3 years ago

@pb82 I see only v3.10.1 available (OpenShift Operator)

NissesSenap commented 3 years ago

https://operatorhub.io/operator/grafana-operator have release v3.10.2 now. You should be able to find it in OLM as well.

alrf commented 3 years ago

http://i.imgur.com/Kuh8DL9.png I've tried on a few OpenShift clusters, everywhere v3.10.1 as latest available version.

alrf commented 3 years ago

@pb82 OK, now v3.10.2 is available. I've upgraded operator, but the issue stayed the same:

Get "http://admin:***@grafana-service:3000/api/folders": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

bunch of these errors in cluster console, I can't reach grafana-service from operator pod by name:

$ oc describe pod/grafana-operator-d585f98c6-85t94 -n grafana-operator | grep '3.10'
              containerImage: quay.io/integreatly/grafana-operator:v3.10.2
    Image:         quay.io/integreatly/grafana-operator:v3.10.2
      OPERATOR_CONDITION_NAME:  grafana-operator.v3.10.2
  Normal  Pulling         48s   kubelet            Pulling image "quay.io/integreatly/grafana-operator:v3.10.2"
  Normal  Pulled          39s   kubelet            Successfully pulled image "quay.io/integreatly/grafana-operator:v3.10.2" in 9.935066938s

$ oc exec -it pod/grafana-operator-d585f98c6-85t94 -n grafana-operator -- bash
bash-4.4$ curl -I http://grafana-service:3000
^C
bash-4.4$

As I said here before, DNS name is unrelated..

I can't reach service even via IP address:

$ oc exec -it pod/grafana-operator-d585f98c6-85t94 -n grafana-operator -- bash
bash-4.4$ curl -I http://172.30.247.230:3000
^C
bash-4.4$ curl -I http://172.30.247.230:9091
HTTP/1.0 400 Bad Request

bash-4.4$ 

Again, everything works with v3.9.0, no changes on cluster side.

alrf commented 3 years ago

@pb82 I found something interesting: "default" example for Grafana deployment (coming as default with Operator) works as expected - I can reach port 3000 from operator pod in this case.

Here my config for Grafana (using it I can't reach port 3000):

apiVersion: v1
data:
  session_secret: <MY_SECRET_HERE>
kind: Secret
metadata:
  name: grafana-k8s-proxy
  namespace: grafana-operator
type: Opaque
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: grafana-proxy
  namespace: grafana-operator
rules:
  - apiGroups:
      - authentication.k8s.io
    resources:
      - tokenreviews
    verbs:
      - create
  - apiGroups:
      - authorization.k8s.io
    resources:
      - subjectaccessreviews
    verbs:
      - create
---
apiVersion: authorization.openshift.io/v1
kind: ClusterRoleBinding
metadata:
  name: grafana-proxy
roleRef:
  name: grafana-proxy
subjects:
  - kind: ServiceAccount
    name: grafana-serviceaccount
    namespace: grafana-operator
userNames:
  - system:serviceaccount:grafana-operator:grafana-serviceaccount
---
apiVersion: v1
kind: ConfigMap
metadata:
  labels:
    config.openshift.io/inject-trusted-cabundle: "true"
  name: ocp-injected-certs
  namespace: grafana-operator
---
apiVersion: integreatly.org/v1alpha1
kind: Grafana
metadata:
  name: grafana-oauth
  namespace: grafana-operator
spec:
  baseImage: <MY_IMAGE_REGISTRY>/grafana:7.3.7
  config:
    log:
      mode: "console"
      level: "warn"
    auth:
      disable_login_form: False
      disable_signout_menu: True
    auth.basic:
      enabled: True
    auth.anonymous:
      enabled: True
      org_role: Admin
  deployment:
    securityContext:
      fsGroup: 472
    hostNetwork: true
    nodeSelector:
      node-role.kubernetes.io/infra: ""
    tolerations:
      - key: "infra"
        operator: "Exists"
        effect: "NoExecute"
  containers:
    - name: grafana-proxy
      args:
        - '-provider=openshift'
        - '-pass-basic-auth=false'
        - '-https-address=:9091'
        - '-http-address='
        - '-email-domain=*'
        - '-upstream=http://localhost:3000'
        - '-openshift-sar={"resource": "namespaces", "verb": "get"}'
        - '-openshift-delegate-urls={"/": {"resource": "namespaces", "verb": "get"}}'
        - '-tls-cert=/etc/tls/private/tls.crt'
        - '-tls-key=/etc/tls/private/tls.key'
        - '-client-secret-file=/var/run/secrets/kubernetes.io/serviceaccount/token'
        - '-cookie-secret-file=/etc/proxy/secrets/session_secret'
        - '-openshift-service-account=grafana-serviceaccount'
        - '-openshift-ca=/etc/pki/tls/cert.pem'
        - '-openshift-ca=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt'
        - '-openshift-ca=/etc/grafana-configmaps/ocp-injected-certs/ca-bundle.crt'
        - '-skip-auth-regex=^/metrics'
      image: 'quay.io/openshift/origin-oauth-proxy:4.6'
      ports:
        - containerPort: 9091
          name: grafana-proxy
      resources: {}
      volumeMounts:
        - mountPath: /etc/tls/private
          name: secret-grafana-k8s-tls
          readOnly: false
        - mountPath: /etc/proxy/secrets
          name: secret-grafana-k8s-proxy
          readOnly: false
  dataStorage:
    class: gp3
    accessModes:
      - ReadWriteOnce
    size: 10Gi
  secrets:
    - grafana-k8s-tls
    - grafana-k8s-proxy
  configMaps:
    - ocp-injected-certs
  service:
    ports:
      - name: grafana-proxy
        port: 9091
        protocol: TCP
        targetPort: grafana-proxy
    annotations:
      service.alpha.openshift.io/serving-cert-secret-name: grafana-k8s-tls
  ingress:
    enabled: True
    targetPort: grafana-proxy
    termination: reencrypt
    hostname: <MY_GRAFANA_URL_HERE>
  client:
    preferService: True
  serviceAccount:
    annotations:
      serviceaccounts.openshift.io/oauth-redirectreference.primary: '{"kind":"OAuthRedirectReference","apiVersion":"v1","reference":{"kind":"Route","name":"grafana-route"}}'
  dashboardLabelSelector:
    - matchExpressions:
        - { key: "app", operator: In, values: ['grafana'] }

p.s. I've tried both: preferService: True and preferService: False

alrf commented 3 years ago

Finally, I found an issue, it was Security Group for Openshift worker nodes. Port 3000 should be opened in case of hostNetwork: true usage and I don't know why it worked before with Operator v3.9.0 (probably because of changes introduced for preferService in v3.10.X). Thank you for your help.