grafana / grafana-operator

An operator for Grafana that installs and manages Grafana instances, Dashboards and Datasources through Kubernetes/OpenShift CRs
https://grafana.github.io/grafana-operator/
Apache License 2.0
877 stars 386 forks source link

[doc] Grafana deployment with a Persistent Volume #1439

Closed harrythecode closed 5 months ago

harrythecode commented 7 months ago

What's the problem?

While trying the example for using Persistent Volume in grafana-operator, which is documented here: https://grafana.github.io/grafana-operator/docs/examples/persistent_volume/readme/

I encountered an error:

GF_PATHS_DATA='/var/lib/grafana' is not writable. You may have issues with file permissions, more information here: http://docs.grafana.org/installation/docker/#migrate-to-v51-or-later mkdir: can't create directory '/var/lib/grafana/plugins': Permission denied

This happens because the new config in Deployment, meant to override the original config, was actually causing permission issues as fixed in this issue: https://github.com/grafana/grafana-operator/issues/300

  deployment:
    securityContext:
      fsGroup: 472

However, I managed to resolve it by setting securityContext - fsGroup: 472 again as shown below:

apiVersion: grafana.integreatly.org/v1beta1
kind: Grafana
metadata:
  name: grafana
  labels:
    dashboards: "grafafna"
spec:
  deployment:
    spec:
      template:
        spec:
+         securityContext:
+           fsGroup: 472
          containers:
            - name: grafana
              image: grafana/grafana:9.4.3

This issue is related to https://github.com/grafana/grafana-operator/issues/1418#issuecomment-1953609940

What I want

Could we update the documentation example to include the securityContext - fsGroup ?

weisdd commented 7 months ago

Seems like you're confusing two different scenarios here.

Persistent volumes

When we're talking about Persistent Volumes, they come empty and they are writable. If you deploy the example you referred to as is, a new volume becomes provisioned through a PVC, the volume gets mounted to /var/lib/grafana, and it'll all works just fine:

image

The snippet (with fsGroup) you shared is contradicting, because it looks like you're not using persistent volumes.

Ephemeral volumes

When you deploy a basic example (quoted below), any changes in Grafana are not persistent (they're gone once the respective pod is gone). To make sure Grafana has enough permissions to store its data, an emptyDir volume is automatically mounted to /var/lib/grafana. Again, everything works just fine.

apiVersion: grafana.integreatly.org/v1beta1
kind: Grafana
metadata:
  name: grafana
  labels:
    dashboards: "grafana"
spec:
  config:
    log:
      mode: "console"
    auth:
      disable_login_form: "false"
    security:
      admin_user: root
      admin_password: secret
image

What I think is happening

uid 472 is used in the default grafana image:

image

But grafana-operator automatically adds runAsNonRoot: true to a pod securityContext, which changes the default uid to something else, the exact id is likely to be different in your case:

image

Assuming you're using persistent volumes (unlike in the snippet you shared), I guess you deployed grafana with modified securityContext (runAsNonRoot: false or something else; directly or through mutating webhooks), grafana created some files, then you redeployed it with another set of settings, and now grafana fails to write data to the pre-existing files, because they had been created with different permissions.

When you specify a custom fsGroup, Kubernetes changes ownership and permissions for files upon pod start (docs).

I don't think we need to update the example. If you redeploy grafana with a brand new volume using an up-to-date version of the operator, it should all just work.

caguiclajmg commented 6 months ago

I'm also seeing what @harrythecode is reporting, this time I made sure that I have no PV/PVCs prior to deploying the Grafana object (and yes, I see a PV getting created so I'm certain the persistentVolumeClaim option in the manifest is taking effect), so a volume with stale files and wrong permissions is unlikely to be the cause.

Currently running v5.6.3 of the operator.

weisdd commented 6 months ago

@caguiclajmg @harrythecode It'd be helpful if you could share more information around your environment:

Also, if it's something that can be reproduced in local (kind, microk8s, ...) or cloud provider environment, then full instructions would be helpful.

bavarian-ng commented 6 months ago

Hi,

I ran into the same issue just today, when I tried to use a persistent volume claim for Grafana in a Kubernetes Cluster (Amazon EKS). With the options from the example yaml I also experienced the same error "missing file permissions". (I also started from scratch, no existing pvcs and so on) I played around a bit and also found this Github issue, which helped me in debugging.

When I add: spec.deployment.spec.template.spec.securityContext.fsGroup: 10001 It seems to work. Maybe this helps in digging into the root cause of this issue. :)

I removed some individual stuff from the YAML I use, but with the following yaml it seems to run at the moment:

kind: Grafana
metadata:
  name: grafana
  namespace: monitoring
  labels:
    dashboards: "grafana"
spec:
  persistentVolumeClaim:
    spec:
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 10Gi
  config:
    log:
      mode: "console"
    [...]
  deployment:
    spec:
      template:
        spec:
          serviceAccountName: secrets-csi-sa-monitoring
          securityContext:
            fsGroup: 10001
          containers:
            - name: grafana
              securityContext:
                allowPrivilegeEscalation: true
                readOnlyRootFilesystem: false
              readinessProbe:
                failureThreshold: 3
              [...]
              image: grafana/grafana:10.4.0
               [...]
          volumes:
            - name: grafana-data
              persistentVolumeClaim:
                claimName: grafana-pvc
           [...]
weisdd commented 6 months ago

@bavarian-ng Thanks for reporting this, though, unfortunately, it's not enough to share only Grafana CR here as the end pod spec can be influenced by various webhooks. Please, take a look at my comment above, which describes which of the resources can give us a better understanding of the things you experience in your cluster.

bavarian-ng commented 6 months ago

@weisdd , sorry here the requested info:


- Full Deployment manifest; (resulting deployment created by operator:)

apiVersion: apps/v1 kind: Deployment metadata: annotations: deployment.kubernetes.io/revision: '1' creationTimestamp: '2024-03-11T14:41:22Z' generation: 1 name: grafana-deployment namespace: monitoring ownerReferences:


- Full Pod manifest; (resulting manifest deployed by Operator)

apiVersion: v1 kind: Pod metadata: creationTimestamp: '2024-03-11T14:41:22Z' generateName: grafana-deployment-7bcdb6d464- labels: app: grafana pod-template-hash: 7bcdb6d464 name: grafana-deployment-7bcdb6d464-pdgdt namespace: monitoring ownerReferences:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    pv.kubernetes.io/bind-completed: 'yes'
    pv.kubernetes.io/bound-by-controller: 'yes'
    volume.beta.kubernetes.io/storage-provisioner: ebs.csi.aws.com
    volume.kubernetes.io/selected-node: <NODE NAME REDACTED>
    volume.kubernetes.io/storage-provisioner: ebs.csi.aws.com
  creationTimestamp: '2024-03-11T14:41:22Z'
  finalizers:
    - kubernetes.io/pvc-protection
  name: grafana-pvc
  namespace: monitoring
  ownerReferences:
    - apiVersion: grafana.integreatly.org/v1beta1
      kind: Grafana
      name: grafana
      uid: 67785905-eb4c-4922-a68a-fe4821aebfb0
  resourceVersion: '90847469'
  uid: 0a718f39-5eb2-4a23-9d29-b630be8539f6
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  storageClassName: gp2
  volumeMode: Filesystem
  volumeName: pvc-0a718f39-5eb2-4a23-9d29-b630be8539f6
status:
  accessModes:
    - ReadWriteOnce
  capacity:
    storage: 10Gi
  phase: Bound
github-actions[bot] commented 5 months ago

This issue hasn't been updated for a while, marking as stale, please respond within the next 7 days to remove this label

algo7 commented 4 months ago

any update on this one?

AbelThorne commented 3 weeks ago

Was there any answer provided after relevant informations were posted? I'm also faced with the same issue on a brand new Grafana deployment on GKE.