tektoncd / results

Long term storage of execution results.
Apache License 2.0
78 stars 74 forks source link

Postgres STS fails to start - mkdir: cannot create directory ‘/bitnami/postgresql/data’ #522

Open kbristow opened 1 year ago

kbristow commented 1 year ago

Expected Behavior

Postgres created as part of the release manifests would start up successfully.

Actual Behavior

The postgres pod does not become healthy. The pod fails and enters crashloopbackoff with the following error in the logs:

mkdir: cannot create directory ‘/bitnami/postgresql/data’: Permission denied

This appears similar to this issue: https://github.com/bitnami/charts/issues/1210

Investigating the recommendation here i added an init container that looks as follows which resolves the issue:

initContainers:
  - name: init-chmod-data
    image: docker.io/bitnami/bitnami-shell:11-debian-11-r130
    imagePullPolicy: "IfNotPresent"
    resources:
      limits: {}
      requests: {}
    command:
      - /bin/sh
      - -ec
      - |
        chown 1001:1001 /bitnami/postgresql
        mkdir -p /bitnami/postgresql/data
        chmod 700 /bitnami/postgresql/data
        find /bitnami/postgresql -mindepth 1 -maxdepth 1 -not -name "conf" -not -name ".snapshot" -not -name "lost+found" | \
          xargs -r chown -R 1001:1001
    securityContext:
      runAsUser: 0
    volumeMounts:
      - name: postgredb
        mountPath: /bitnami/postgresql

I took the above from the output of running the below (with slight modifications):

helm template my-release oci://registry-1.docker.io/bitnamicharts/postgresql --set volumePermissions.enabled=true

Note that I am using EKS with ebs volumes and a storage class as below if useful:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ebs-sc
allowVolumeExpansion: true
provisioner: ebs.csi.aws.com
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer

Steps to Reproduce the Problem

  1. Install Tekton Results (v0.7.0) on eks 1.23 with ebs volume
  2. Observe error with postgres pod

Additional Info

Server Version: v1.23.17-eks-c12679a
v0.38.4
xinnjie commented 1 year ago

Did you ever create PVC to interact with Postgres manually? This may cause different file permission since your reclaim policy of storage class is Retain.

If the data is only for test, try:

  1. Delete Postgres deployment (or entire Results deployment).
  2. Delete pv that Postgres used.
  3. Redeploy Postgres (or Results)
kbristow commented 1 year ago

I have tried the above but the same issue occurs. I am happy to use the fix i did as per my issue., I wanted to raise it as a potential issue others using Tekton Results may run into.

xinnjie commented 1 year ago

I'm a little curious what's the environment difference make you encounter this problem. Could you

  1. create a running pod mount the pvc that Postgres used
  2. attach to the pod3
  3. use ls -l /bitnami/postgresql to check who is owner of that directory?

It would be weird If the directory belongs to root. The deployment configuration never uses root privilege. Then could be ebs default file permission.

Another approach would be modifying StorageClass mountOptions:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ebs-sc
allowVolumeExpansion: true
provisioner: ebs.csi.aws.com
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
mountOptions:
  - uid=1001
  - gid=1001
kbristow commented 1 year ago

Looks like root ownership is correct and is the root cause of the issue:

$ ls -l /bitnami/postgresql
total 16
drwx------ 2 root root 16384 Jul 13 13:43 lost+found

I guess that is how ebs volumes are permissioned by default. Is that something you want to cater for in your default release manifest? Whilst I probably wont be using the postgres created via the Results release manifest, for users that want to try Results, it may be worthwhile putting something in to handle this permission issue incase?

Either way, happy to close the issue from my side if there is not something further you want me test.

xinnjie commented 1 year ago

The default Results release does require right file permissions in volume implicitly.

Whilst I probably wont be using the postgres created via the Results release manifest, for users that want to try Results, it may be worthwhile putting something in to handle this permission issue incase?

Yes, agreed with you. Especially for users want to try out in environment provided by cloud providers, the default file permission strategy varies depending on which storage they use.

It would be appreciated if you could make a PR for it, document this potential permission issue and handle the permission (of course you cloud let me do it if you don't want to).

We could close this issue after merging that PR.

kbristow commented 1 year ago

I am not going to be around until next Wednesday so happy for you to do the PR. To add, once you mentioned that the postgres is running as user 1001, I realised I could just set spec.template.spec.securityContext.fsGroup: 1001 on the postgres sts which also resolves my issue. That seems like a better solution and probably doesnt need any documentation changes anything either. Thoughts?

xinnjie commented 1 year ago

Ok let me do it.

tekton-robot commented 1 year ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot commented 1 year ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten with a justification. Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

gerrnot commented 8 months ago

The same happens for the logs PV:

{"level":"error","ts":1711353330.4722874,"caller":"zap/options.go:212","msg":"finished streaming call with code Unknown","grpc.auth_disabled":false,"grpc.start_time":"2024-03-25T07:55:29Z","system":"grpc","span.kind":"server","grpc.service":"tekton.results.v1alpha2.Logs","grpc.method":"UpdateLog","peer.address":"10.248.106.207:48550","grpc.user":"system:serviceaccount:tekton-pipelines:tekton-results-watcher","grpc.issuer":"https://kubernetes.default.svc.cluster.local","error":"failed to create directory /logs/yournamespace/4c90c662-6e12-3c8a-b6ef-4e8f3eb8b23f, mkdir /logs/yournamespace: permission denied","grpc.code":"Unknown","grpc.time_duration_in_ms":951,"stacktrace":"github.com/grpc-ecosystem/go-grpc-middleware/logging/zap.DefaultMessageProducer\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/logging/zap/options.go:212\ngithub.com/grpc-ecosystem/go-grpc-middleware/logging/zap.StreamServerInterceptor.func1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/logging/zap/server_interceptors.go:61\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:49\ngithub.com/grpc-ecosystem/go-grpc-middleware/tags.StreamServerInterceptor.func1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/tags/interceptors.go:39\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:49\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:58\ngoogle.golang.org/grpc.(*Server).processStreamingRPC\n\tgoogle.golang.org/grpc@v1.60.1/server.go:1673\ngoogle.golang.org/grpc.(*Server).handleStream\n\tgoogle.golang.org/grpc@v1.60.1/server.go:1787\ngoogle.golang.org/grpc.(*Server).serveStreams.func2.1\n\tgoogle.golang.org/grpc@v1.60.1/server.go:1016"}
zbialik commented 1 month ago

I am not going to be around until next Wednesday so happy for you to do the PR. To add, once you mentioned that the postgres is running as user 1001, I realised I could just set spec.template.spec.securityContext.fsGroup: 1001 on the postgres sts which also resolves my issue. That seems like a better solution and probably doesnt need any documentation changes anything either. Thoughts?

Can we get this implemented, please?