nextcloud / helm

A community maintained helm chart for deploying Nextcloud on Kubernetes.
GNU Affero General Public License v3.0
337 stars 268 forks source link

Stuck at "Initializing Nextcloud..." when attached to NFS PVC #10

Open somerandow opened 4 years ago

somerandow commented 4 years ago

Doing my best to dupe helm/charts#22920 over to the new repo as I am experiencing this issue as well. I have refined the details a bit, as this issue appears to be specifically related to NFS-based storage.

Describe the bug

When bringing up the nextcloud pod via the helm chart, the logs show the pod as being stuck at:

2020-08-31T19:00:42.054297154Z Configuring Redis as session handler
2020-08-31T19:00:42.098305129Z Initializing nextcloud 19.0.1.1 ...

Even backing out the liveness/readiness probes to over 5 minutes does not give If I instead switch the PVC to my storageClass for Rancher Longhorn (iSCSI) for example, the nextcloud install initializes in seconds.

Version of Helm and Kubernetes:

helm: v3.3.0 kubernetes: v1.18.6

Which chart:

nextcloud/helm

What happened:

What you expected to happen:

Nextcloud finishes initialization Nextcloud files appear with correct permissions on NFS volume

How to reproduce it (as minimally and precisely as possible):

Set up an NFS provisioner:

helm install stable/nfs-client-provisioner nfs  \
--set nfs.server=x.x.x.x --set nfs.path=<path>

OR Configure an NFS PV and PVC manually

apiVersion: v1
kind: PersistentVolume
metadata:
  name: nextcloud-data
  labels:
    app: cloud
    type: data
spec:
  capacity:
    storage: 100Ti
  nfs:
    path: <path>
    server: <server>
  mountOptions:
    - async
    - nfsvers=4.2
    - noatime
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: nfs-manual
  volumeMode: Filesystem
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: nextcloud-data
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 100Ti
  storageClassName: nfs-manual
  volumeMode: Filesystem
  selector:
    matchLabels:
      app: cloud
      type: data

Install nextcloud helm install -f values.yaml nextcloud/helm nextcloud --namespace=nextcloud

values.yaml:

image:
  repository: nextcloud
  tag: 19
readinessProbe:
  initialDelaySeconds: 560
livenessProbe:
  initialDelaySeconds: 560
resources:
  requests:
    cpu: 200m
    memory: 500Mi
  limits:
    cpu: 2
    memory: 1Gi
ingress:
  enabled: true
  annotations:
    cert-manager.io/cluster-issuer: acme
    kubernetes.io/ingress.class: nginx
    # nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
  hosts:
    - "cloud.myhost.com"
  tls:
    - hosts:
        - "cloud.myhost.com"
      secretName: prod-cert
  path: /
nextcloud:
  username: admin
  password: admin1
  # datadir: /mnt/data
  host: "cloud.myhost.com"
internalDatabase:
  enabled: true
externalDatabase:
  enabled: false
persistence:
  enabled: true
  # accessMode: ReadWriteMany
  # storageClass: nfs-client if creating via provisioner
  existingClaim: nextcloud-data # comment out if creating new PVC via provisioner
somerandow commented 4 years ago

I will add as well that my example PV above includes:

  mountOptions:
    - async
    - nfsvers=4.2
    - noatime

These do not appear to affect (or improve) the NFS performance at all in this case. Based on the other deployments I have utilizing NFS, this seems odd.

thunerbl commented 4 years ago

Hello there,

I've got the same issue, NFS PVC works well with Nextcloud v17. But, as you @WojoInc with Nextcloud v19 i'm stuck at "Initializing Nextcloud...".

Even if installation seems fail, and the Pod loop on restart , My NFS volume seems wrotten by Nextcloud v19 data. I'm trying now to get more verbosity about that.

Have a nice time :)

Scizoo88 commented 4 years ago

Hi,

I faced the same problem. I logged in to the physical node and watched the docker logs. There I saw that Nextcloud tried to connect via HTTP to the defined Host. I have HAProxy (OPNSense) in front of Kubernetes and redirect all HTTP to HTTPS. This was an issue. For the init process of Nextcloud I temporary added the HTTP rule for it and the process completed without problems.

Maybe you have a similar setup?

BR Scizoo

thunerbl commented 4 years ago

Hello @Scizoo88,

Thanks for sharing your experience. I don't think I've that setup, because my Nextcloud 19 pod, without NFS PVC for now, is accessible both via HTTP and HTTPS.

In my case, the unique difference between a working and not working setup, is that i've enabled data persistence (if I choose Nextcloud v19). Persistence greatly worked on Nextcloud 17, with the same Kubernetes network setup tought

Have a nice day,

thunerbl commented 4 years ago

Okeii I've managed to connect with externalDB, Nextcloud 19 seems installed and functionnal pretty well, PVC enabled. Maybe this error is SQLite related.

chrisingenhaag commented 4 years ago

Hi guys, I already checked this. We´re using a fixed fsGroup for the apache and the nginx container. Because nextcloud copies files around via rsync on startup it relies on valid permissions to the volumes.

But in my case the user id and groups on my nfs client mount are different. My logs show permission denied errors.

I see two possible solutions:

For the moment I would tend to go for sidecar possibility so that you guys can handle volume permissions by yourself.

Best

somerandow commented 4 years ago

I seem to run in to errors with permissions even when the nfs mount is owned by www-data. I have tried manually editing the securityContext to set the fsGroupChangePolicy, and this didn't seem to resolve the issue either. I'll dive in a bit more and test out whether a side car or init container could set the permissions correctly.

somerandow commented 4 years ago

I seem to have resolved the performance issues around the use of NFS. Rsync was being forced to use synchronous writes due to NFS default behavior and how rsync checks for copied files. CP was slightly faster, but the real fix was enabling the async option on the NFS export (I had only been adding this option to the Persistent Volume), at least for the initial install. This took the time to initialize nextcloud down from >15 mins to just under 10 seconds.

I plan to test the permissions if the permissions are still an issue now.

J3m5 commented 4 years ago

I'm experiencing the same problem, I tried to change the securityContext params but that didn't solve the problem...

davad commented 4 years ago

I think I'm having the same issue:

  1. the container is being periodically restarted
  2. the only output to the log is "Initializing nextcloud 19.0.3.1 ..."
  3. the PVC is automatically created from my NFS storage class

I'll try adding the async option to the host and PV, then report back.

Edit: having trouble adding async to my NFS server because of the storage class provider I'm using.

unixfox commented 4 years ago

@WojoInc Could you explain how you changed the NFS export options?

sOblivionsCall commented 4 years ago

also looking for guidance here, seeing a permission issue that i'm not sure is an easy solve as i'm also using a nfs-provisioner

kubectl logs nextcloud-7969756654-7j9xh --tail 50 -f
Initializing nextcloud 19.0.4.2 ...
Upgrading nextcloud from 17.0.0.9 ...
Initializing finished
Console has to be executed with the user that owns the file config/config.php
Current user: www-data
Owner of config.php: root
Try adding 'sudo -u root ' to the beginning of the command (without the single quotes)
If running with 'docker exec' try adding the option '-u root' to the docker command (without the single quotes)

i would go change the default permissions of NFS but all other pods using NFS would run into issues then. Previously you discussed options to change the storage owner via a sidecar or fsGroupChangePolicy. Can you please expand on how this is accomplished?

sundowndev commented 4 years ago

I have the same issue, and the container does not contain any log file. Any workaround for this?

EDIT: the issue appear to come from the livenessProbe delay being too low, the initialization does not have time to finish. Disabling both livenessProbe and readinessProbe worked for me (Nextcloud 19-apache):

livenessProbe:
  enabled: false
readinessProbe:
  enabled: false
Janl1 commented 3 years ago

I seem to have resolved the performance issues around the use of NFS. Rsync was being forced to use synchronous writes due to NFS default behavior and how rsync checks for copied files. CP was slightly faster, but the real fix was enabling the async option on the NFS export (I had only been adding this option to the Persistent Volume), at least for the initial install. This took the time to initialize nextcloud down from >15 mins to just under 10 seconds.

I plan to test the permissions if the permissions are still an issue now.

@WojoInc Are you using the nextcloud helm chart with replication set to e.g. 3?

mikeyGlitz commented 3 years ago

I'm using the following configuration on the helm chart using terraform to set up the release:

resource "kubernetes_namespace" "ns_files" {
  metadata {
    name = "files"
  }
}

resource "helm_release" "rel_files_cloud" {
  repository = "https://nextcloud.github.io/helm/"
  name="cloudfiles"
  chart = "nextcloud"
  namespace="files"

  values = [
      <<YAML
        ingress:
          enabled: true
          annotations:
            kubernetes.io/ingress.class: traefik
            cert-manager.io/cluster-issuer: cluster-issuer
            traefik.ingress.kubernetes.io/redirect-entry-point: https
            traefik.frontend.passHostHeader: "true"
          tls:
            - hosts:
              - files.haus.net
              secretName: nextcloud-app-tls
      YAML
   ]

  set {
    name = "nextcloud.host"
    value = "files.haus.net"
  }

  set {
      name = "nextcloud.username"
      value = "vault:secret/data/nextcloud/app/credentials#app_user"
  }
  set {
      name = "nextcloud.password"
      value = "vault:secret/data/nextcloud/app/credentials#app_password"
  }
  set {
      name = "mariadb.enabled"
      value = "true"
  }
  set {
      name = "mariadb.db.password"
      value = "vault:secret/data/nextcloud/db/credentials#db_password"
  }
  set {
      name = "mariadb.db.user"
      value = "vault:secret/data/nextcloud/db/credentials#db_user"
  }
  set {
      name = "mariadb.master.persistence.storageClass"
      value = "nfs-client"
  }
  set {
      name = "mariadb.master.annotations.vault\\.security\\.banzaicloud\\.io/vault-addr"
      value = "https://vault.vault-system:8200"
  }
  set {
      name = "mariadb.master.annotations.vault\\.security\\.banzaicloud\\.io/vault-tls-secret"
      value = "vault-cert-tls"
  }
  set {
      name = "mariadb.master.annotations.vault\\.security\\.banzaicloud\\.io/vault-role"
      value = "default"
  }
  set {
      name = "persistence.enabled"
      value = "true"
  }
  set {
      name = "persistence.storageClass"
      value = "nfs-client"
  }
  set {
      name = "persistence.size"
      value = "2.5Ti"
  }
  set {
      name = "podAnnotations.vault\\.security\\.banzaicloud\\.io/vault-addr"
      value = "https://vault.vault-system:8200"
  }
  set {
      name = "podAnnotations.vault\\.security\\.banzaicloud\\.io/vault-tls-secret"
      value = "vault-cert-tls"
  }
  set {
      name = "podAnnotations.vault\\.security\\.banzaicloud\\.io/vault-role"
      value = "default"
  }
}

I end up with the following log for nextcloud:

time="2020-12-15T23:02:34Z" level=info msg="received new Vault token" app=vault-env
time="2020-12-15T23:02:35Z" level=info msg="initial Vault token arrived" app=vault-env
time="2020-12-15T23:02:35Z" level=info msg="spawning process: [/entrypoint.sh apache2-foreground]" app=vault-env
Initializing nextcloud 19.0.3.1 ...

I check the nfs-client-provisioner and notice that the folders have the following permissions:

/mnt/external/files-cloudfiles-nextcloud-nextcloud-pvc-646eb797-7470-4dd3-94cc-590b9ca5a074# ll
total 36
drwxrwxrwx  9 root     root 4096 Dec 15 22:47 ./
drwxr-xr-x 13 root     root 4096 Dec 15 23:07 ../
drwxrwxrwx  2 root     root 4096 Dec 15 22:47 config/
drwxrwxrwx  2 root     root 4096 Dec 15 22:47 custom_apps/
drwxrwxrwx  2 root     root 4096 Dec 15 22:47 data/
drwxrwxrwx  8 www-data root 4096 Dec 15 23:02 html/
drwxrwxrwx  4 root     root 4096 Dec 15 22:47 root/
drwxrwxrwx  2 root     root 4096 Dec 15 22:47 themes/
drwxrwxrwx  2 root     root 4096 Dec 15 22:47 tmp/

My /etc/exports has the following configuration

/mnt/external 192.168.0.120/32(rw,no_root_squash,insecure,async,no_subtree_check,anonuid=1000,anongid=1000) 172.16.0.0/29(rw,no_root_squash,insecure,async,no_subtree_check,anonuid=1000,anongid=1000) 10.42.0.0/16(rw,no_root_squash,insecure,async,no_subtree_check,anonuid=1000,anongid=1000)
immanuelfodor commented 3 years ago

I'm not using the Helm chart, I've just manually created a Deployment for NC with an nfs-client-provisioner volume but I experience the same issue. In my case, I moved a previous NC install to k8s, so my log output consists of the initializing line then an upgrading line. Then stuck forever. Execing inside the pod, and running top, it seems an rsync command is running forever.

immanuelfodor commented 3 years ago

What's most disturbing is that the S and D statuses mean sleep and uninterruptible sleep, so it seems all the syncs are not doing anything. Also tried setting fsGroup to 33 but nothing changes, and the existing files are at the right permission from the previous non-k8s install I think.

root@nextcloud-55c6cb7cbd-d9cmv:/var/www/html# ps aux --width 200              
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND     
root           1  0.0  0.0   2388  1444 ?        Ss   18:45   0:00 /bin/sh /entrypoint.sh /usr/bin/supervisord -c /supervisord.conf                           
root          32  0.0  0.1 114460 12568 ?        S    18:45   0:00 rsync -rlDog --chown www-data:root --delete --exclude-from=/upgrade.exclude /usr/src/nextcloud/ /var/www/html/                                                            
root          33  0.0  0.1 126596  8372 ?        S    18:45   0:00 rsync -rlDog --chown www-data:root --delete --exclude-from=/upgrade.exclude /usr/src/nextcloud/ /var/www/html/                                                            
root          34  0.2  0.0 114620  3796 ?        D    18:45   0:01 rsync -rlDog --chown www-data:root --delete --exclude-from=/upgrade.exclude /usr/src/nextcloud/ /var/www/html/                                                            
root          63  0.0  0.0   4000  3076 pts/0    Ss   18:53   0:00 bash        
root          72  0.0  0.0   7640  2664 pts/0    R+   18:54   0:00 ps aux --width 200
maxirus commented 3 years ago

I am having the same issue with v20.0.4.

immanuelfodor commented 3 years ago

I was using NFSv4.2 in my previous try. When I fixed the version to NFSv3, it went further, but then also stuck with an occ PHP command. Symptoms are the same, deep sleep of the PHP thread. My NFS server is in a privileged CentOS Stream LXC container in Proxmox with NFS and FUSE feature enabled and backed by a bind mount from the host. When NC was stuck on both NFSv3/v4, I've seen a kernel panic with NFS logs in it on the host, and I couldn't use NFS reliably further. Restarting the whole host helped only to stabilize it. This is both absurd and funny at once: starting NC in k8s collapsing the whole hypervisor, lol 😃

unixfox commented 3 years ago

I was using NFSv4.2 in my previous try. When I fixed the version to NFSv3, it went further, but then also stuck with an occ PHP command. Symptoms are the same, deep sleep of the PHP thread. My NFS server is in a privileged CentOS Stream LXC container in Proxmox with NFS and FUSE feature enabled and backed by a bind mount from the host. When NC was stuck on both NFSv3/v4, I've seen a kernel panic with NFS logs in it on the host, and I couldn't use NFS reliably further. Restarting the whole host helped only to stabilize it. This is both absurd and funny at once: starting NC in k8s collapsing the whole hypervisor, lol 😃

Can you also replicate huge load average numbers when running nextcloud with NFS in a k8s cluster for at least one week? I've to also restart the node because I get 100 on load average for some unknown reason due to nextcloud.

immanuelfodor commented 3 years ago

I has never started up to get to the web interface, just stuck in either rsync phase (NFSv4.2) or an occ command (NFSv3). The CPU usage was minuscule (~1m).

unixfox commented 3 years ago

I has never started up to get to the web interface, just stuck in either rsync phase (NFSv4.2) or an occ command (NFSv3). The CPU usage was minuscule (~1m).

CPU usage is not the only component to check. Look at the load average on htop while it's doing its thing.

immanuelfodor commented 3 years ago

It really didn't do anything, all threads were sleeping in top (S + D flags).

dentropy commented 3 years ago

I have had the same error and was able to resolve it by fixing /etc/exports. I was also using the nfs-provisioner.

My previous /etc/exports file was

/mnt/nfsdir -async,no_subtree_check *(rw,insecure,sync,no_subtree_check,no_root_squash)

I changed it to the rancher /etc/exports example and I was able to deploy nextcloud successfully.

/mnt/nfsdir    *(rw,sync,no_subtree_check,no_root_squash)
jonkerj commented 3 years ago

I've been having this issue as well. I think it's caused by three things:

When I look at my nfsd stats (grafana/prometheus/node-exporter), there is a lot (+/- 50% of the IOPS) of GetAttr (caused by lstat syscalls) going on during the rsync. When using block-based volumes, these are served from local cache, which is magnitudes quicker.

Sure, async,noatime will improve things, and maybe even throw in NFS3, but in the end you're rsyncing a truckload of files onto an NFS share, and that's not very efficient.

I'd suggest to enable the startupProbe, and tweak the periodSeconds and failureThreshold. This is probably better than tweaking/disabling the readiness/liveness probes.

danielvandenberg95 commented 3 years ago

Same issue with a Kadalu backend. I set the initial delay to a day, let's see what happens...

-- edit two hours to initialize.

dcardellino commented 3 years ago

Is there any solution for this issue?

I tried every suggestion with no success :(

lknite commented 2 years ago

Don't believe anyone when they tell you NFS or CIFS works with file locking. Inevitably you will experience data corruption. I recommend a solution such as longhorn or similar in a kubernetes environment. It will use local storage on each worker node and iscsi behind the scene as needed to create your pvc.

We all start out using NFS in linux world but it just doesn't support full file locking. iSCSI takes some time to learn. You'd be might be better off using something like longhorn and letting it do the iscsi for you. Seriously, abandon NFS, don't waste anymore of your life trying to get it to work.

I can't even begin to tell you how fast and flawlessly everything works with iSCSI and how nice it is to have the slowness of NFS and inevitable bizarre failures of NFS behind me. Make the change. Do it, do it now. (or just buy a network storage device which uses iSCSI)

https://forums.plex.tv/t/roadmap-to-allow-network-share-for-configuration-data/761162

** Update that I wanted to note that I learned if you have the right nfs-specific hardware that NFS can perform as quickly as iSCSI. Also that vmware has some sort of protection it adds to its nfs shares so if using those they actually do support full file locking. Also that longhorn isn't perfect and it tries to use NFS with its RWX shares (sigh), but RWO w/ longhorn works. Think I'm going to switch to rook / ceph.

jonkerj commented 2 years ago

Locking is not the issue here, it's the fact that lstat is not served by a local FS or cache.

I think both NFS and block based solutions have their place, even in a Kubernetes context, and both come with their unique advantages and problems. In this (specific) case I totally agree with you: a block based solution will not have this problem.

devent commented 2 years ago

It's a permission issue I think. The pod fails with:

rsync: [receiver] chown "/var/www/html/resources/config/.mimetypealiases.dist.json.bYpaGG" failed: Operation not permitted (1)
rsync: [receiver] chown "/var/www/html/resources/config/.mimetypemapping.dist.json.ChHk9F" failed: Operation not permitted (1)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1333) [sender=3.2.3]

All files are synced but because rsync can't do chown it returns a non-zero code.

Y0ngg4n commented 2 years ago

Initialization took 30 minutes for me :(

pcgeek86 commented 2 years ago

Same problem here. I tried to deploy Nextcloud as a container running on AWS Fargate. I attached an Amazon Elastic Filesystem (EFS) mount to the Fargate task, and mounted it to /var/www/html/. According to CloudWatch Logs, the only output from the container is:

Initializing nextcloud 21.0.8.3 ...

image

I was hoping to connect Nextcloud to an S3-compatible storage provider, so I could share out some images through a user-friendly web front-end. Never used Nextcloud before, so kinda disappointed that it doesn't "just work." Granted, it's free, so I'm thankful for that. :)

Note: Since I'm using Fargate, I'm obviously not using Kubernetes / Helm to deploy.

💥 Edit: pfffffft ... of course, literally right as I posted this, initialization completed. It took almost 7 minutes! Yikes, lol. But at least it is running now!

image

Y0ngg4n commented 2 years ago

New highscore....No initialization after 4 days....seems to not work on nfs

legolego621 commented 2 years ago

Hello everybody! I had this problem. I resolved it enabling async mode on nfs server.

# cat /etc/exports
/mnt/nfs-storage   192.168.2.0/23(rw,async,no_subtree_check,no_root_squash)

# exportfs -a

As a result, the initialization took one and a half minutes, against 15-20 with the sync mode enabled.

ATTENTION! As far as I understand, the difference between async and sync of the nfs server mode is that the data does not arrive immediately (sync), but in blocks (async).

Because nextcloud generates a very large number of small files, there is a problem in synс mode.

As a result, I gained performance even after initialization, my download speed was 40% faster.

Y0ngg4n commented 2 years ago

@legolego621 thats a good to know workaround but it is not advised to use async option on zfs for example.

legolego621 commented 2 years ago

@legolego621 thats a good to know workaround but it is not advised to use async option on zfs for example.

yes, this is because data corruption can theoretically occur during transmission. This must be taken into account

Y0ngg4n commented 2 years ago

@legolego621 exactly

5cat commented 2 years ago

I think the issue here is nfs get bottlenecked due to the large number of files rsync is doing. I tried to debug the thing since the log output is kinda useless, the startup is stuck at this line in the entrypoint.sh and have confirmed it from ps aux inside the container

USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
www-data      1  0.0  0.0   2420  1552 ?        Ss   04:46   0:00 /bin/sh /entrypoint.sh apache2-foreground
www-data     20  0.5  0.0  76012  9884 ?        S    04:46   0:00 rsync -rlD --delete --exclude-from=/upgrade.exclude /usr/sr
www-data     21  1.4  0.0 102172  3996 ?        S    04:46   0:01 rsync -rlD --delete --exclude-from=/upgrade.exclude /usr/sr
www-data     22  3.2  0.0 148916  5076 ?        D    04:46   0:02 rsync -rlD --delete --exclude-from=/upgrade.exclude /usr/sr

So I can confrim @immanuelfodor findings in here

I have two storage classes from my volume provisioner, one for large slow HDD storage which i initally tried to use and another with small fast SSD storage.

The issue was fixed for me when I switched to mounting the volume to the SSDs over NFS. It is a known issue that rsync can be slow for a lot of small files over NFS since it needs to do a lot of synced iops. I was going to spend up the time trying to optimize that rsync line in the entrypoint.sh but luckly the SSD saved my time.

kashapovd commented 2 years ago

Hi guys! Recently I faced with the same problem. I've tried to deploy nextcloud with replicas > 1, but it seems that all initializing nextcloud replicas creates a big load on storage. The solution was deploying nextcloud with only one replica and, when it successfully started, increasing replicas to desired amount. You can also disable probes on the first start (when it initializing)

--set livenessProbe.enabled=false --set readinessProbe.enabled=false
piecko commented 1 year ago

Hi, afaik the only permanent solution would be to use an init-container for updating/migrating db and else. Probes do not apply to initContainers and k8s will not kill them automatically. This should avoid the problem in all cases without the need of temporary workarounds.

jessebot commented 1 year ago

Perhaps it would be helpful if someone created a PR to explain this in a section of the readme? I or someone else with permissions would be happy to review to it! :) There's a few things that come up around NFS, so it would be nice to improve our documentation on that to help others.

pchang388 commented 1 year ago

I stumbled upon this thread and got things working (so far, it's early still) with multiple replicas (3) and fixed the permission issue for the most part with some small workarounds.

My backend storage is a Synology NAS (my first NAS I've purchased very recently) running a NFS server and one of the issues that I see in here is similar to what I'm seeing: permissions. In Synology, it has only very cookie cutter options for NFS and they don't allow you to do plain/simple mappings of UID/GUID or any real customizations from what I'm learning so far. I heard it's possible through other means (kerberos) but did not get that far yet.

So in order to run multiple replicas with this specific helm chart, from what I understand you need:

  1. Redis Cluster
  2. RWX backend storage for at least the data dir
  3. External DB (Postgres for me)
  4. LB proxy in front of all the replicas (k8 service should do this in default round robin I believe)

The permissions arises (at least in my case) due to:

  1. the www-data user running as uid/guid: 33 and also doing a rsync with chown flag as mentioned a few times in this thread
    • So essentially it sounds like it is copying over some init files for it to run and chowning
  2. Synology only allows basic options like: "map to admin/guest uid and guid" and "no mapping" (essentially don't squash uid/guid into admin or guest)
    • This is the big hinderance for this specific deployment method (k8) of nextcloud. Docker and docker compose lets you specify a user with a flag with run or within the docker-compose.yaml file using --user and --group flags. Kubernetes doesn't really have the same option I believe, and most similar issues are solved in k8 by initContainers, lifecyclehooks, or securityContexts. Or by setting a uid/guid by env (If image supports it) or a combination of the two possibly.
  3. Since the uid/guid www-data cannot be changed in this style of deployment, I have to let it pass through with 33 into Synology or enable squashing. Squashing is an issue because of the chown and that's not going to work and you will see permission denied.
    • So how do I get 33 to map over properly?

As mentioned, this works for me due to Synology NAS NFS being a key part of the problem so YMMV. It also works partially because rsync does not work on NFS well so initializing phase can take a long time.

NOTE: the original fix (further down) was not consistent enough for me during restarts/updates so I am going to try the alternative route mentioned in that section until the docker image/entrypoint allows us to specify uid/guid directly via env vars or similar means. I will cross it out but leave it up if anyone wants to try it further than I.**

The alternative I am going to try something is similar to this: https://github.com/nextcloud/docker/issues/359#issuecomment-1154170151 but slightly modified since guid 100 will exist in the nextcloud image. I just have to do an assign the www-data to 100 guid as primary instead of change his group, example: **

FROM nextcloud:23.0.5-fpm-alpine
RUN usermod -g users www-data && \
    usermod --uid 1028 www-data

Section below that kind of worked but not consistent enough and causes issues, leaving it up for anyone who wants to give it a read anyways

What I ended up doing was:

  1. (Optional) In Synology, create a user to own the folders in the NFS share we will create
  2. In Synology, create a shared folder and enable NFS share in it
    • I have two NFS shares enabled: one that has squashing and another that does not. In this case, this one should not have any squashing. In the mappings you should leave it as "no mapping". I also enabled async, and the other checkmark options for my specific environment. Also ensure you give permissions via IP/hostname in the settings.
  3. Give the user permissions to r/w to the NFS share
  4. ssh into the Synology NAS and become root
  5. Create a subdirectory in the NFS share, example (nextcloud dir): /volume1/k3s-nosquash-share/nextcloud
  6. Change the owner of the folder to the user you created, cat /etc/passwd then: chown -R 1028:100 nextcloud (100 is the users group and every user you create by default is in there) - replace the uid with the user you created
  7. I'm using the nfs-subdir provisioner and configured a specific storage class just for nextcloud, example:
    nfs:
    server: something.something
    path: /volume1/k3s-nosquash-share/nextcloud
    mountOptions:
      - nfsvers=4.1 # version synology has runs on (at least mine does)
    volumeName: nfs-subdir-nosquash-nextcloud
    # Reclaim policy for the main nfs volume
    reclaimPolicy: Retain
  8. Reference that storage class in the values.yaml file
  9. Add a postStartHook, I don't believe initContainers would help with permissions issues in my specific scenario with Synology. You could try to modify uid/guid with an initcontainer some how by doing a mount on /etc/passwd and /etc/group of the nextcloud pod I believe but I opted for this manner for simplicity and not mess with the chart too much:
    # Allow configuration of lifecycle hooks
    # ref: https://kubernetes.io/docs/tasks/configure-pod-container/attach-handler-lifecycle-event/
    lifecycle: 
    # not gauranteed to run before entrypoint starts/finishes, it's async but let's see? It should due to rsync job in entrypoint.sh
    postStartCommand: ["/bin/bash","-c","sed -i 's/33:33/1028:100/g' /etc/passwd && sed -i 's/:33:/:100:/g' /etc/group"]
    • Basically simple replace of uid to the one that maps to my synology user I created
    • This enabled me to run multi-replica so far
  10. Turn on startupProbe and give it 2+ min of delay depending on your env, should give time for rsync to finish without starting liveness/readiness probes until that time elapses
    startupProbe:
    enabled: true
    initialDelaySeconds: 120  # change this
    periodSeconds: 10
    timeoutSeconds: 5
    failureThreshold: 30
    successThreshold: 1

I typed a lot already so I'm just going to add my unfiltered notes I took that add more details. I tried A LOT OF things, the other viable option that is kind of annoying but I think is simple is to do something like building your own image on top that modifies the uid/guid, example: https://github.com/nextcloud/docker/issues/359#issuecomment-1154170151

NOTES:

  • What about postStartHook
    • the entrypoint runs a rsync so it could take some time and postStartHook starts async when container is done initializing but not yet running
    • so it could run before entrypoint does anything??
    • And instead of change user's group to groupid, we can change him to existing 100 group
          testt@6fa28ea33de4:/var/www/html$ id
          uid=1003(testt) gid=100(users) groups=100(users)
    • Not sure this will work, you need to start a new shell for uid/guid changes to reflect?
      postStartCommand: ["/bin/sh","-c","usermod -g users www-data && usermod --uid 1028 www-data"]
      • Yep it failed due to not reflecting (user usually needs to logout and back in for it to take affect - i.e. new shell needed)
      • What about changing /etc/passwd and /etc/groups?
        postStartCommand: ["/bin/bash","-c","sed -i 's/33:33/1028:100/g' /etc/passwd && sed -i 's/:33:/:100:/g' /etc/group"]
    • It worked and is instant. The only issue is that a few inital folders during rsync got created before it went through
    • I ssh into NAS and did a chown -R for all subdirs
      ash-4.4# chown -R 1028:100 nextcloud/
      ash-4.4# cd nextcloud/nextcloud-nextcloud-nextcloud-pvc-59b09ff6-09f7-4e9f-bfc5-0aa9119dc650/
      ash-4.4# ls -lrt
      total 0
      drwxrwxrwx 1 k3s-nextcloud users 76 Feb  6 03:45 nextcloud-nextcloud-nextcloud-pvc-59b09ff6-09f7-4e9f-bfc5-0aa9119dc650
      ash-4.4# cd nextcloud-nextcloud-nextcloud-pvc-59b09ff6-09f7-4e9f-bfc5-0aa9119dc650/
      ash-4.4# ls -lrt
      total 0
      drwxrwxrwx 1 k3s-nextcloud users   0 Feb  6 03:45 custom_apps
      drwxrwxrwx 1 k3s-nextcloud users   0 Feb  6 03:45 tmp
      drwxrwxrwx 1 k3s-nextcloud users  14 Feb  6 03:45 root
      drwxrwxrwx 1 k3s-nextcloud users  26 Feb  6 03:48 themes
      drwxrwxrwx 1 k3s-nextcloud users 468 Feb  6 03:48 html
      drwxrwxrwx 1 k3s-nextcloud users 346 Feb  6 03:48 config
      drwxrwxrwx 1 k3s-nextcloud users 152 Feb  6 03:48 data
    • I restarted the nc pods to make sure it would behave okay after restarts when the postStart ran again
    • seems okay so far, no files owned by root/other created and no permissions errors yet

As mentioned, this helps my use case but I hope it helps some others too. If anyone wants to see my yaml files for redis setup or other things, I can also provide that.

pchang388 commented 1 year ago

With the synology NAS situation I described above, I still had trouble with the docker image method due to the docker entrypoint script.

Dockerfile I'm using to fix permissions for synology:

# https://stackoverflow.com/questions/60450479/using-arg-and-env-in-dockerfile
ARG PLATFORM=linux/amd64
ARG IMAGE=nextcloud
ARG TAG=25.0.3-apache

## Due to M1 MBpro issues, build it for amd64 linux instead of arm64 which M1 mbpro uses
## I guess by default it will offer the image subversion that fits your cpu arch
## https://stackoverflow.com/questions/73398714/docker-fails-when-building-on-m1-macs-exec-usr-local-bin-docker-entrypoint-sh
FROM --platform=${PLATFORM} ${IMAGE}:${TAG}

RUN usermod -g users www-data && \
    usermod --uid 1028 www-data

RUN mkdir -p /var/www/html/config && \
    chown -R 1028:100 /var/www && \
    chmod -R 750 /var/www

entrypoint that nextcloud uses, creates and populates the /var/www/html/config folder as root user. This means we end up with this scenario even after the fix:

root@nextcloud-79855df575-9pjr8:/var/www/html# ls -lrt
total 160
drwxrwsr-x  2 www-data www-data  4096 Feb  7 08:57 custom_apps
drwxrwsr-x  2 root     www-data  4096 Feb  7 08:57 config

root@nextcloud-79855df575-zmkdn:/var/www/html/config# ls -lrt
total 32
-rw-r--r-- 1 root www-data  158 Feb  7 10:37 trusted-domains.config.php
-rw-r--r-- 1 root www-data  668 Feb  7 10:37 smtp.config.php
-rw-r--r-- 1 root www-data  329 Feb  7 10:37 redis.config.php
- As discussed in my previous comment, initContainers probably won't work here since the volume is populated by the entrypoint itself and init runs before the main nc container
- postStartHook is not consistent enough
- securityContext like fsgroup, etc. aren't going to work as well since the volume is changed after mounting by entrypoint. 

Nextcloud will fail from what appears to be another permission issue:

"Hint":"Configuration was not read or initialized correctly, not overwriting /var/www/html/config/config.php"

So for Synology NAS NFS users, I think the best way to do this (especially for those that want to have simplicity/uptime), is by simply using docker/docker compose (where you can specify the -user and group ids) and then mounting the NFS on the hosts themselves. Having to keep adding workarounds and try to adjust other systems to fit the limitations of this docker image/entrypoint (not being able to provide uid/guid) in k8 environment is not sustainable I think in the long run.

The issue with nextcloud image appears to be that the entrypoint is doing a lot of bulk of the initializing for nextcloud. Providing a uid/guid during docker build phase would probably solve it but users would have to build it locally every time. Preffered way - You could also probably have env variables that are provided to the container (by pod env spec ) for UID/GUID and have the entrypoint script use those properly, example: https://github.com/nextcloud/docker/blob/master/25/apache/entrypoint.sh

## original
if [ "$(id -u)" = 0 ]; then
    rsync_options="-rlDog --chown $user:$group"
else
    rsync_options="-rlD"
fi

## proposed?  - check for custom uid and guid being provided
if [[ -z "${CUSTOM_UID}" && -z "${CUSTOM_GUID}"]]; then 
    rsync_options="-rlDog --chown $CUSTOM_UID:$CUSTOM_GUID"
elif [ "$(id -u)" = 0 ]; then
    rsync_options="-rlDog --chown $user:$group"
else
    rsync_options="-rlD"
fi

I was hoping to have backend storage separately for nc data and be able to scale workers in k8 easily but I do not think it is ready/stable enough for Synology NAS users at this time. But if someone wants to give it a shot, I hope my information helps them.

BloodStainedCrow commented 1 year ago

Had the same issue spinning up a container on kubernetes with a pvc via NFS on a TrueNAS-Scale NAS. Setting the dataset in question to force async fixed the freezing.

asosnovsky commented 1 year ago

have the same issue with this running on k3s with nas export from true-nas scale and using the nfs-subdir-external-provisioner. Debugging it further just seems to imply that there are some rsync commands trying to copy around 500M and are taking way too long to finish.

image

not sure why the container has 3 instances of rsync that seem to do the same thing, maybe that can be causing the slowdown?

Chiloy commented 1 year ago

p a container on kubernetes with a pvc via NFS on a TrueNAS-Scale NAS. Setting the dataset in question to force async fixed t

hi, i had the same issue and use truenas ,can you tell me how to set async on truenas, thanks

BloodStainedCrow commented 1 year ago

@Chiloy Find the dataset in your Datasets panel. On the right of Dataset Details there is an Edit button for the dataset. There you can find the setting Sync which I set to Disabled. Though if you don't have to, doing that is not advisable, since it can lead to lost data in case of sudden power loss.

Chiloy commented 1 year ago

@Chiloy Find the dataset in your Datasets panel. On the right of Dataset Details there is an Edit button for the dataset. There you can find the setting Sync which I set to Disabled. Though if you don't have to, doing that is not advisable, since it can lead to lost data in case of sudden power loss.

thanks, but had a new problem, Prompt issue is 'Configuration was not read or initialized correctly, not overwriting /var/www/html/config/config.php', It looks like a matter of permissions,i don't set podsecurityContext in helm charts,Can you help me look at the problem truenas dir permissions: root@truenas[~]# ls -al /mnt/data total 34 drwxr-xr-x 5 root wheel 5 Apr 18 17:29 . drwxr-xr-x 3 root wheel 128 Mar 28 17:08 .. drwxrwxrwx 25 k8s k8s 32 Apr 20 18:46 k8s drwxr-xr-x 5 www-data www-data 5 Apr 19 18:08 share drwxr-xr-x 2 root wheel 2 Mar 28 17:09 work root@truenas[~]# id k8s uid=1000(k8s) gid=1000(k8s) groups=1000(k8s),545(builtin_users),1001(www-data) truenas nfs server: mapall user :root mapall group: wheel

containerd logs: root@nextcloud-ops-prod-696fbb9d68-5zwl2:/var/www/html# ls -al total 231 drwxrwxrwx 15 www-data www-data 30 Apr 21 01:47 . drwxrwxrwx 4 root 1000 4 Apr 21 01:46 .. -rw-r--r-- 1 www-data www-data 3256 Apr 21 01:46 .htaccess -rw-r--r-- 1 www-data www-data 101 Apr 21 01:46 .user.ini drwxr-xr-x 45 www-data www-data 52 Apr 21 01:46 3rdparty -rw-r--r-- 1 www-data www-data 19327 Apr 21 01:46 AUTHORS -rw-r--r-- 1 www-data www-data 34520 Apr 21 01:46 COPYING drwxr-xr-x 50 www-data www-data 50 Apr 21 01:47 apps drwxrwxrwx 2 root 1000 11 Apr 21 01:47 config -rw-r--r-- 1 www-data www-data 4095 Apr 21 01:46 console.php drwxr-xr-x 24 www-data www-data 30 Apr 21 01:47 core -rw-r--r-- 1 www-data www-data 6317 Apr 21 01:46 cron.php drwxrwxrwx 2 www-data www-data 2 Apr 21 01:46 custom_apps drwxrwxrwx 2 www-data www-data 3 Apr 21 01:47 data drwxr-xr-x 2 www-data www-data 168 Apr 21 01:47 dist -rw-r--r-- 1 www-data www-data 156 Apr 21 01:46 index.html -rw-r--r-- 1 www-data www-data 3456 Apr 21 01:46 index.php drwxr-xr-x 6 www-data www-data 9 Apr 21 01:47 lib -rw-r--r-- 1 root 1000 0 Apr 21 01:46 nextcloud-init-sync.lock -rwxr-xr-x 1 www-data www-data 283 Apr 21 01:46 occ drwxr-xr-x 2 www-data www-data 3 Apr 21 01:47 ocm-provider drwxr-xr-x 2 www-data www-data 5 Apr 21 01:47 ocs drwxr-xr-x 2 www-data www-data 3 Apr 21 01:47 ocs-provider -rw-r--r-- 1 www-data www-data 3139 Apr 21 01:46 public.php -rw-r--r-- 1 www-data www-data 5549 Apr 21 01:46 remote.php drwxr-xr-x 4 www-data www-data 8 Apr 21 01:47 resources -rw-r--r-- 1 www-data www-data 26 Apr 21 01:46 robots.txt -rw-r--r-- 1 www-data www-data 2452 Apr 21 01:46 status.php drwxrwxrwx 3 www-data www-data 4 Apr 21 01:47 themes -rw-r--r-- 1 www-data www-data 384 Apr 21 01:47 version.php root@nextcloud-ops-prod-696fbb9d68-5zwl2:/var/www/html# cd config/ root@nextcloud-ops-prod-696fbb9d68-5zwl2:/var/www/html/config# ls -al total 50 drwxrwxrwx 2 root 1000 11 Apr 21 01:47 . drwxrwxrwx 15 www-data www-data 30 Apr 21 01:47 .. -rw-r--r-- 1 root www-data 261 Apr 21 01:46 .htaccess -rw-r--r-- 1 root www-data 59 Apr 21 01:46 apache-pretty-urls.config.php -rw-r--r-- 1 root www-data 69 Apr 21 01:46 apcu.config.php -rw-r--r-- 1 root www-data 376 Apr 21 01:46 apps.config.php -rw-r--r-- 1 root www-data 1102 Apr 21 01:46 autoconfig.php -rw-r--r-- 1 root 1000 0 Apr 21 01:47 config.php -rw-r--r-- 1 root www-data 329 Apr 21 01:46 redis.config.php -rw-r--r-- 1 root www-data 63 Apr 21 01:46 rewriter.config.php -rw-r--r-- 1 root www-data 668 Apr 21 01:46 smtp.config.php

Rehtard commented 1 year ago
  1. Since the uid/guid www-data cannot be changed in this style of deployment, I have to let it pass through with 33 into Synology or enable squashing. Squashing is an issue because of the chown and that's not going to work and you will see permission denied.

    • So how do I get 33 to map over properly?

I have the same issue, except that I am using a WD MyCloud EX2 where I can access the /etc/exports file and modify the guid and uid. I changed both to 33 but this didn't solve the problem. Do you maybe have any idea what else I could try?

Also the config folder and some other files still have root as the owner while almost every other file is owned by www-data.

mddeff commented 1 year ago

Just adding another datapoint

2023-08-05T01:04:36.724302556-04:00 Initializing nextcloud 27.0.1.2 ...
2023-08-05T01:09:08.543902901-04:00 New nextcloud instance

Running at about 4.5 minutes for me. NFS backed by 4 disk "raid 10" topology on zfs. Intel Datacenter Flash.