[stable/nextcloud] Image stuck at Initializing NextCloud... when PVC is attached

mikeyGlitz commented 4 years ago

Describe the bug

When the helm chart is bringing up NextCloud, the application does not get past the log message

Initializing Nextcloud 17.0.7...

Version of Helm and Kubernetes:

helm: v3.2.1 kubernetes: v1.18.4+k3s1

Which chart:

stable/nextcloud

What happened:

Namespace is created. Helm creates persistent-volume-claim Helm instantiates MariaDB using bitnami/mariadb chart Helm instantiates Nextcloud container Nextcloud container starts Nextcloud container does not get past

Initializing Nextcloud 17.0.7...

What you expected to happen:

Nextcloud was supposed to finish initialization Nextcloud files were supposed to be copied with correct permissions to the PVC

How to reproduce it (as minimally and precisely as possible):

Initialize helm with the following:

helm install stable/nfs-client-provisioner nfs --namespace=nas  \
--set nfs.server=x.x.x.x --set nfs.path=/mnt/external

helm install -f values.yaml stable/nextcloud files --namespace=nextcloud

values.yaml

ingress:
  enabled: true
  annotations:
    kubernetes.io/ingress.class: traefik
    cert-manager.io/cluster-issuer: cluster-issuer
    traefik.ingress.kubernetes.io/redirect-entry-point: https
    traefik.frontend.passHostHeader: "true"
  tls:
   - hosts:
     - files.haus.net
      secretName: nextcloud-app-tls
nextcloud.host: files.haus.net
nextcloud.username: admin
nextcloud.password: P@$$w0rd!
internalDatabase:
  enabled: false
mariadb:
  enabled: yes
  password: P@$$w0rd!
  user: nextcloud
  name: nextcloud
persistence:
  enabled: yes
  storageClass: nfs-client
  size: 1Ti

11jwolfe2 commented 4 years ago

I have also been trying to get this install to work with a PV and PVC and no luck, If I do it without a PVC and PV it works, as soon as I enable the PV, it says nextcloud directory isn't found, so I make the directory. Then it says "Error: failed to create subPath directory for volumeMount "nextcloud-data" of container "nextcloud"". does anyone have any ideas about this?

derdrdirk commented 4 years ago

I am having the same issue. I also use nfs-client as storageClass, which might cause this bug? IIRC I used a manual created PV some time back and it worked.

Have you figured out how to make this work?

almahmoud commented 4 years ago

Not sure if we are having the same issue, but I will detail my investigation so far on trying to use persistence.existingClaim, in case it helps people progress in their own investigations and/or if the context would help someone more knowledgeable provide some help as I have only worked with k8s for a year or so.

From what I could see, the container creation process errors out with:

Error: failed to start container "nextcloud": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"rootfs_linux.go:58: mounting \\\"/var/lib/kubelet/pods/49c19090-14d6-4bee-b774-ca24b0ddd259/volume-subpaths/jun30third-nextcloud-data-pv/nextcloud/0\\\" to rootfs \\\"/var/lib/docker/overlay2/40dca10bcad3a57d61d35d40d0bd897f6d2322c3a5d9f615d2a90a38d7fe4cd5/merged\\\" at \\\"/var/lib/docker/overlay2/40dca10bcad3a57d61d35d40d0bd897f6d2322c3a5d9f615d2a90a38d7fe4cd5/merged/var/www\\\" caused \\\"no such file or directory\\\"\"": unknown

I've looked on the node during the time of the directory creation, and some things to note:

the source directory always exists and the path is correct
the destination directory up to (.*)/merged is created when the container is being spun up, but I could never see merged directory inside (although I didn't have the container ID beforehand so relied on watch commands and manually looking on the node, so I can't guarantee that it was never there, I just know I could never see it there)

The only lead I've found so far to why this might be happening is https://github.com/kubernetes/kubernetes/issues/61545#issuecomment-465887014 and the following comment links https://github.com/kubernetes/kubernetes/issues/61563#issuecomment-428364190. My guess is that this is related to the second issue in the last comment (i.e. https://github.com/kubernetes/kubernetes/issues/61545), given that the config mounts are nested inside the directory mount, however given that the error is on subpath /nextcloud/0 of the container (which I have verified is the root subpath), this might not be true but is my best lead so far.

I'm currently poking by manually changing specifications to see if any configuration works (i.e. trying different variations of the mountpaths nesting to see if I can get it to start up manually before figuring out how to correct the chart), but in the meantime if anyone else finds a solution and/or if it seems I'm going down the wrong trail, please let me know!

Update: it is not the configmap causing this in my case, it's the nested mounts: https://github.com/helm/charts/blob/master/stable/nextcloud/templates/deployment.yaml#L289 Additionally, the problem only appears after the first restart (it seems that the first time it can do the mounting, but once things get written to the volumes and the container restarts, the bind mounts fail for the new container with the above error). This problem might be specific to our storage class (we're using an RClone CSI which fuse-mounts an s3 bucket) and different from yours, although I haven't tried it with a nfs layer on top yet to confirm. This does seem to be different than what you're seeing though... (sorry for hijacking your issue).

In case this comes up for anyone else current workaround is keeping only the root directory mount (which is enough to backup everything else as they are nested) and that seems to fix the problem

11jwolfe2 commented 4 years ago

Okay I got it working! I am using a Open Media vault NFS share for all of my persistent volumes. I set them up with the following settings and it now works without any issues when using the regular helm install, no extra stuff required.

Settings for nfs share.

rw,no_root_squash,insecure,async,no_subtree_check,anonuid=1000,anongid=1000

almahmoud commented 4 years ago

It also works with nfs-server-provisioner (https://github.com/helm/charts/tree/master/stable/nfs-server-provisioner) with expected values.

Specific values we're using

helm install nfs-provisioner stable/nfs-server-provisioner
    --namespace myns
    --set persistence.enabled=true
    --set persistence.storageClass="ebs"
    --set persistence.size=100Gi
    --set storageClass.create=true
    --set storageClass.reclaimPolicy="Delete"
    --set storageClass.allowVolumeExpansion=true

and NextCloud snippet:

persistence:
  enabled: true
  storageClass: nfs
  accessMode: "ReadWriteMany"

I'll open a separate issue for the existingClaim problem

mikeyGlitz commented 4 years ago

Okay I got it working! I am using a Open Media vault NFS share for all of my persistent volumes. I set them up with the following settings and it now works without any issues when using the regular helm install, no extra stuff required.

Settings for nfs share.

rw,no_root_squash,insecure,async,no_subtree_check,anonuid=1000,anongid=1000

Tried changing the line in my /etc/exports and it didn't fix the problem.

mikeyGlitz commented 4 years ago

Using the following snippets:

nfs-client-provisioner.values.yaml

nfs:
  mountOptions:
    - nfsvers=4
  server: 172.16.0.1
  nfs.path: /mnt/external

I updated my nextcloud values with the new value persistence.accessMode=ReadWriteMany.

Also didn't work.

I have the following directories in my volume:

drwxrwxrwx 9 root     root 4096 Jul  4 01:04 ./
drwxr-xr-x 7 root     root 4096 Jul  4 01:09 ../
drwxrwxrwx 2 root     root 4096 Jul  4 01:04 config/
drwxrwxrwx 2 root     root 4096 Jul  4 01:04 custom_apps/
drwxrwxrwx 2 root     root 4096 Jul  4 01:04 data/
drwxrwxrwx 8 www-data root 4096 Jul  4 01:08 html/
drwxrwxrwx 4 root     root 4096 Jul  4 01:04 root/
drwxrwxrwx 2 root     root 4096 Jul  4 01:04 themes/
drwxrwxrwx 2 root     root 4096 Jul  4 01:04 tmp/

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

tomhouweling1987 commented 4 years ago

Got the same problem. Tested with version 17.0.0-apache and 19.0.1-apache. Also seeing that the dirs are root:root. When we deploy without PVC the installation works

jesussancheztellomm commented 4 years ago

Using nfs-client-provisioner works but the main problem is that the initial rsync takes arround 5 minutes to complete (at least in my tests using GCP Filestore). You can look at the entrypoint.sh file.

rsync -rlDog --chown www-data:root --delete --exclude-from=/upgrade.exclude /usr/src/nextcloud/ /var/www/html/

If you disable the readiness and the liveness in the values, it works.

❯ k logs nextcloud-5756597dbc-nhg5m Initializing nextcloud 17.0.8.1 ... Initializing finished New nextcloud instance Installing with PostgreSQL database starting nextcloud installation Nextcloud was successfully installed setting trusted domains… System config value trusted_domains => 1 set to string XXXXXXXX AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 10.192.149.41. Set the 'ServerName' directive globally to suppress this message AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 10.192.149.41. Set the 'ServerName' directive globally to suppress this message [Tue Aug 11 08:54:50.097547 2020] [mpm_prefork:notice] [pid 1] AH00163: Apache/2.4.38 (Debian) PHP/7.3.21 configured -- resuming normal operations [Tue Aug 11 08:54:50.097621 2020] [core:notice] [pid 1] AH00094: Command line: 'apache2 -D FOREGROUND'

I've trying some alternatives to that rsync but since there are a lot of small files to copy i haven't found any improvement.

Any ideas?

timtorChen commented 4 years ago

Log looks like stuck at Initializing Nextcloud 17.0.7..., because the rsync process extremely slow (for my local nfs, it is about 1.5MB/s, show the progress with rsync --info=progress2). More worsely, the liveness probe will continuously fail and finally get CrashLoopBackOff.

As the workaround like jesussancheztellomm, I disable the liveness probe on first installation, and enable it after finishing the installation.

Maybe we can refer to nextcloud/docker#968 It will not solve the problem of slow nfs transmission speed (I still have no idea why ...), but stateless application may remove the rsync process.

tomhouweling1987 commented 4 years ago

@timtorChen i can confirm, when i disabled the LivinessProbe it took 11min to sync. Also tried it with an S3 Storage backend, it took just seconds to sync.

So i looked deeper in my NFS, and we are using SYNC instead of ASYNC because we want not lose any data. I didnt test it with an ASYNC connection.

billimek commented 4 years ago

The nextcloud chart has migrated to a new repo. Can you please raise the issue over there? https://github.com/nextcloud/helm

somerandow commented 4 years ago

Opened an incident over on the new repo. Tried to summarize some of the info from this discussion.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

stale[bot] commented 3 years ago

This issue is being automatically closed due to inactivity.

helm / charts

[stable/nextcloud] Image stuck at Initializing NextCloud... when PVC is attached #22920

helm: v3.2.1 kubernetes: v1.18.4+k3s1