Stuck at "Initializing Nextcloud..." when attached to NFS PVC

somerandow commented 4 years ago

Doing my best to dupe helm/charts#22920 over to the new repo as I am experiencing this issue as well. I have refined the details a bit, as this issue appears to be specifically related to NFS-based storage.

Describe the bug

When bringing up the nextcloud pod via the helm chart, the logs show the pod as being stuck at:

2020-08-31T19:00:42.054297154Z Configuring Redis as session handler
2020-08-31T19:00:42.098305129Z Initializing nextcloud 19.0.1.1 ...

Even backing out the liveness/readiness probes to over 5 minutes does not give If I instead switch the PVC to my storageClass for Rancher Longhorn (iSCSI) for example, the nextcloud install initializes in seconds.

Version of Helm and Kubernetes:

helm: v3.3.0 kubernetes: v1.18.6

Which chart:

nextcloud/helm

What happened:

Namespace is created.
Helm creates NFS PVC, or it is created manually
Helm instantiates Nextcloud pod
Nextcloud pod attaches PVC, and starts
Nextcloud container is stuck at the above line

What you expected to happen:

Nextcloud finishes initialization Nextcloud files appear with correct permissions on NFS volume

How to reproduce it (as minimally and precisely as possible):

Set up an NFS provisioner:

helm install stable/nfs-client-provisioner nfs  \
--set nfs.server=x.x.x.x --set nfs.path=<path>

OR Configure an NFS PV and PVC manually

apiVersion: v1
kind: PersistentVolume
metadata:
  name: nextcloud-data
  labels:
    app: cloud
    type: data
spec:
  capacity:
    storage: 100Ti
  nfs:
    path: <path>
    server: <server>
  mountOptions:
    - async
    - nfsvers=4.2
    - noatime
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: nfs-manual
  volumeMode: Filesystem
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: nextcloud-data
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 100Ti
  storageClassName: nfs-manual
  volumeMode: Filesystem
  selector:
    matchLabels:
      app: cloud
      type: data

Install nextcloud helm install -f values.yaml nextcloud/helm nextcloud --namespace=nextcloud

values.yaml:

image:
  repository: nextcloud
  tag: 19
readinessProbe:
  initialDelaySeconds: 560
livenessProbe:
  initialDelaySeconds: 560
resources:
  requests:
    cpu: 200m
    memory: 500Mi
  limits:
    cpu: 2
    memory: 1Gi
ingress:
  enabled: true
  annotations:
    cert-manager.io/cluster-issuer: acme
    kubernetes.io/ingress.class: nginx
    # nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
  hosts:
    - "cloud.myhost.com"
  tls:
    - hosts:
        - "cloud.myhost.com"
      secretName: prod-cert
  path: /
nextcloud:
  username: admin
  password: admin1
  # datadir: /mnt/data
  host: "cloud.myhost.com"
internalDatabase:
  enabled: true
externalDatabase:
  enabled: false
persistence:
  enabled: true
  # accessMode: ReadWriteMany
  # storageClass: nfs-client if creating via provisioner
  existingClaim: nextcloud-data # comment out if creating new PVC via provisioner

somerandow commented 4 years ago

I will add as well that my example PV above includes:

  mountOptions:
    - async
    - nfsvers=4.2
    - noatime

These do not appear to affect (or improve) the NFS performance at all in this case. Based on the other deployments I have utilizing NFS, this seems odd.

thunerbl commented 4 years ago

Hello there,

I've got the same issue, NFS PVC works well with Nextcloud v17. But, as you @WojoInc with Nextcloud v19 i'm stuck at "Initializing Nextcloud...".

Even if installation seems fail, and the Pod loop on restart , My NFS volume seems wrotten by Nextcloud v19 data. I'm trying now to get more verbosity about that.

Have a nice time :)

Scizoo88 commented 4 years ago

Hi,

I faced the same problem. I logged in to the physical node and watched the docker logs. There I saw that Nextcloud tried to connect via HTTP to the defined Host. I have HAProxy (OPNSense) in front of Kubernetes and redirect all HTTP to HTTPS. This was an issue. For the init process of Nextcloud I temporary added the HTTP rule for it and the process completed without problems.

Maybe you have a similar setup?

BR Scizoo

thunerbl commented 4 years ago

Hello @Scizoo88,

Thanks for sharing your experience. I don't think I've that setup, because my Nextcloud 19 pod, without NFS PVC for now, is accessible both via HTTP and HTTPS.

In my case, the unique difference between a working and not working setup, is that i've enabled data persistence (if I choose Nextcloud v19). Persistence greatly worked on Nextcloud 17, with the same Kubernetes network setup tought

Have a nice day,

thunerbl commented 4 years ago

Okeii I've managed to connect with externalDB, Nextcloud 19 seems installed and functionnal pretty well, PVC enabled. Maybe this error is SQLite related.

chrisingenhaag commented 4 years ago

Hi guys, I already checked this. We´re using a fixed fsGroup for the apache and the nginx container. Because nextcloud copies files around via rsync on startup it relies on valid permissions to the volumes.

But in my case the user id and groups on my nfs client mount are different. My logs show permission denied errors.

I see two possible solutions:

add a sidecar or possibility for generic sidecar containers to make somethink like chown -R ....
try to use securityContext.fsGroupChangePolicy = Always (kubernetes 1.18 alpha)

For the moment I would tend to go for sidecar possibility so that you guys can handle volume permissions by yourself.

Best

somerandow commented 4 years ago

I seem to run in to errors with permissions even when the nfs mount is owned by www-data. I have tried manually editing the securityContext to set the fsGroupChangePolicy, and this didn't seem to resolve the issue either. I'll dive in a bit more and test out whether a side car or init container could set the permissions correctly.

somerandow commented 4 years ago

I seem to have resolved the performance issues around the use of NFS. Rsync was being forced to use synchronous writes due to NFS default behavior and how rsync checks for copied files. CP was slightly faster, but the real fix was enabling the async option on the NFS export (I had only been adding this option to the Persistent Volume), at least for the initial install. This took the time to initialize nextcloud down from >15 mins to just under 10 seconds.

I plan to test the permissions if the permissions are still an issue now.

J3m5 commented 4 years ago

I'm experiencing the same problem, I tried to change the securityContext params but that didn't solve the problem...

davad commented 4 years ago

I think I'm having the same issue:

the container is being periodically restarted
the only output to the log is "Initializing nextcloud 19.0.3.1 ..."
the PVC is automatically created from my NFS storage class

I'll try adding the async option to the host and PV, then report back.

Edit: having trouble adding async to my NFS server because of the storage class provider I'm using.

unixfox commented 4 years ago

@WojoInc Could you explain how you changed the NFS export options?

sOblivionsCall commented 4 years ago

also looking for guidance here, seeing a permission issue that i'm not sure is an easy solve as i'm also using a nfs-provisioner

kubectl logs nextcloud-7969756654-7j9xh --tail 50 -f
Initializing nextcloud 19.0.4.2 ...
Upgrading nextcloud from 17.0.0.9 ...
Initializing finished
Console has to be executed with the user that owns the file config/config.php
Current user: www-data
Owner of config.php: root
Try adding 'sudo -u root ' to the beginning of the command (without the single quotes)
If running with 'docker exec' try adding the option '-u root' to the docker command (without the single quotes)

i would go change the default permissions of NFS but all other pods using NFS would run into issues then. Previously you discussed options to change the storage owner via a sidecar or fsGroupChangePolicy. Can you please expand on how this is accomplished?

sundowndev commented 4 years ago

I have the same issue, and the container does not contain any log file. Any workaround for this?

EDIT: the issue appear to come from the livenessProbe delay being too low, the initialization does not have time to finish. Disabling both livenessProbe and readinessProbe worked for me (Nextcloud 19-apache):

livenessProbe:
  enabled: false
readinessProbe:
  enabled: false

Janl1 commented 3 years ago

I seem to have resolved the performance issues around the use of NFS. Rsync was being forced to use synchronous writes due to NFS default behavior and how rsync checks for copied files. CP was slightly faster, but the real fix was enabling the async option on the NFS export (I had only been adding this option to the Persistent Volume), at least for the initial install. This took the time to initialize nextcloud down from >15 mins to just under 10 seconds.

I plan to test the permissions if the permissions are still an issue now.

@WojoInc Are you using the nextcloud helm chart with replication set to e.g. 3?

mikeyGlitz commented 3 years ago

I'm using the following configuration on the helm chart using terraform to set up the release:

resource "kubernetes_namespace" "ns_files" {
  metadata {
    name = "files"
  }
}

resource "helm_release" "rel_files_cloud" {
  repository = "https://nextcloud.github.io/helm/"
  name="cloudfiles"
  chart = "nextcloud"
  namespace="files"

  values = [
      <<YAML
        ingress:
          enabled: true
          annotations:
            kubernetes.io/ingress.class: traefik
            cert-manager.io/cluster-issuer: cluster-issuer
            traefik.ingress.kubernetes.io/redirect-entry-point: https
            traefik.frontend.passHostHeader: "true"
          tls:
            - hosts:
              - files.haus.net
              secretName: nextcloud-app-tls
      YAML
   ]

  set {
    name = "nextcloud.host"
    value = "files.haus.net"
  }

  set {
      name = "nextcloud.username"
      value = "vault:secret/data/nextcloud/app/credentials#app_user"
  }
  set {
      name = "nextcloud.password"
      value = "vault:secret/data/nextcloud/app/credentials#app_password"
  }
  set {
      name = "mariadb.enabled"
      value = "true"
  }
  set {
      name = "mariadb.db.password"
      value = "vault:secret/data/nextcloud/db/credentials#db_password"
  }
  set {
      name = "mariadb.db.user"
      value = "vault:secret/data/nextcloud/db/credentials#db_user"
  }
  set {
      name = "mariadb.master.persistence.storageClass"
      value = "nfs-client"
  }
  set {
      name = "mariadb.master.annotations.vault\\.security\\.banzaicloud\\.io/vault-addr"
      value = "https://vault.vault-system:8200"
  }
  set {
      name = "mariadb.master.annotations.vault\\.security\\.banzaicloud\\.io/vault-tls-secret"
      value = "vault-cert-tls"
  }
  set {
      name = "mariadb.master.annotations.vault\\.security\\.banzaicloud\\.io/vault-role"
      value = "default"
  }
  set {
      name = "persistence.enabled"
      value = "true"
  }
  set {
      name = "persistence.storageClass"
      value = "nfs-client"
  }
  set {
      name = "persistence.size"
      value = "2.5Ti"
  }
  set {
      name = "podAnnotations.vault\\.security\\.banzaicloud\\.io/vault-addr"
      value = "https://vault.vault-system:8200"
  }
  set {
      name = "podAnnotations.vault\\.security\\.banzaicloud\\.io/vault-tls-secret"
      value = "vault-cert-tls"
  }
  set {
      name = "podAnnotations.vault\\.security\\.banzaicloud\\.io/vault-role"
      value = "default"
  }
}

I end up with the following log for nextcloud:

time="2020-12-15T23:02:34Z" level=info msg="received new Vault token" app=vault-env
time="2020-12-15T23:02:35Z" level=info msg="initial Vault token arrived" app=vault-env
time="2020-12-15T23:02:35Z" level=info msg="spawning process: [/entrypoint.sh apache2-foreground]" app=vault-env
Initializing nextcloud 19.0.3.1 ...

I check the nfs-client-provisioner and notice that the folders have the following permissions:

/mnt/external/files-cloudfiles-nextcloud-nextcloud-pvc-646eb797-7470-4dd3-94cc-590b9ca5a074# ll
total 36
drwxrwxrwx  9 root     root 4096 Dec 15 22:47 ./
drwxr-xr-x 13 root     root 4096 Dec 15 23:07 ../
drwxrwxrwx  2 root     root 4096 Dec 15 22:47 config/
drwxrwxrwx  2 root     root 4096 Dec 15 22:47 custom_apps/
drwxrwxrwx  2 root     root 4096 Dec 15 22:47 data/
drwxrwxrwx  8 www-data root 4096 Dec 15 23:02 html/
drwxrwxrwx  4 root     root 4096 Dec 15 22:47 root/
drwxrwxrwx  2 root     root 4096 Dec 15 22:47 themes/
drwxrwxrwx  2 root     root 4096 Dec 15 22:47 tmp/

My /etc/exports has the following configuration

/mnt/external 192.168.0.120/32(rw,no_root_squash,insecure,async,no_subtree_check,anonuid=1000,anongid=1000) 172.16.0.0/29(rw,no_root_squash,insecure,async,no_subtree_check,anonuid=1000,anongid=1000) 10.42.0.0/16(rw,no_root_squash,insecure,async,no_subtree_check,anonuid=1000,anongid=1000)