nextcloud / helm

A community maintained helm chart for deploying Nextcloud on Kubernetes.
GNU Affero General Public License v3.0
328 stars 265 forks source link

Nextcloud fails to initialize when using a claim with "accessMode: ReadWriteMany" for primary persistence. #399

Open ScionOfDesign opened 1 year ago

ScionOfDesign commented 1 year ago

Describe your Issue

When trying to able persistence for NextCloud, the container hangs when attempting to us emy own existing pvc's.

Logs and Errors

The container hangs with the following log:

Configuring Redis as session handler
Initializing nextcloud 26.0.1.1 ...

Describe your Environment

My persistence section:

  persistence:
    # Nextcloud Data (/var/www/html)
    enabled: true
    existingClaim: nextcloud-html-data-claim

    nextcloudData:
      enabled: true
      existingClaim: nextcloud-user-data-claim

Additional context, if any

It works fine if I comment out the usage of existing claims.

ScionOfDesign commented 1 year ago

It apparently has the same issue if I try to use my own storage classes and configuration.

persistence:
    # Nextcloud Data (/var/www/html)
    enabled: true
    #existingClaim: nextcloud-html-data-claim
    storageClass: longhorn-nvme
    accessMode: ReadWriteMany
    size: 8Gi

    nextcloudData:
      enabled: true
      storageClass: longhorn-block
      accessMode: ReadWriteMany
      size: 10Gi
      #existingClaim: nextcloud-user-data-claim
ScionOfDesign commented 1 year ago

It seems that the issue is with the accessMode of the primary persistent volume. It cannot be ReadWriteMany. This works:

  persistence:
    # Nextcloud Data (/var/www/html)
    enabled: true
    #existingClaim: nextcloud-html-data-claim
    storageClass: longhorn-block
    #accessMode: ReadWriteOnce
    #size: 8Gi

    nextcloudData:
      enabled: true
      storageClass: longhorn-block
      accessMode: ReadWriteMany
      #size: 10Gi
      #existingClaim: nextcloud-user-data-claim
ScionOfDesign commented 1 year ago

The issue seems to be related to: https://github.com/nextcloud/helm/issues/10 Disabling the probes worked. I will continue to investigate.

jgrossmac commented 1 year ago

Having the same issue using my own storage class as well. Disabling probes doesnt help as the server seems to have trouble starting.

boomam commented 1 year ago

Same issue for me too. Only difference disabling probes had for me is that the pod is now 'running', but the same issue persists - Initializing nextcloud 26.0.2.1 ...

jessebot commented 1 year ago

hmmm, for those having this issue, could you let me know if there's any Events listed when you do a:

# replace $NEXTCLOUD_POD with your actual pod name
kubectl describe pod $NEXTCLOUD_POD

Similarly, the existing claims, do they have any Events when you run a describe?

# replace $NEXTCLOUD_PVC with your actual pvc name
kubectl describe pvc $NEXTCLOUD_PVC

Also does the status show pending there for the PVC?

@tvories or @provokateurin have you tried using ReadWriteMany PVCs with the nextcloud container before? I tried to use longhorn at one point, but couldn't get it working, and assumed it was because I misconfigured longhorn so I gave up and went back to the local path with k3s 🤔 We'd need to have ReadWriteMany working in order to support multiple pod replicas accross multiple nodes accessing the same PVC but I'm unsure what's currently blocking that....

Anyone else in the community who has knowledge on this is also welcome to give input :)

provokateurin commented 1 year ago

A have used this chart with longhorn in the past, but it was RWO iirc.

tvories commented 1 year ago

I am using ReadWriteMany on an NFS mount for my primary Nextcloud storage and have been for a very long time:

@ScionOfDesign can you paste your PVC values? My guess is that this is a Longhorn configuration issue.

I am using existingClaim for my pvc rather than having the chart create it.

# pvc.yaml
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: nextcloud-nfs-config
spec:
  storageClassName: nextcloud-nfs-config
  capacity:
    storage: 1Mi
  accessModes:
    - ReadWriteMany
  nfs:
    path: /mnt/fatguys/k8s/nextcloud
    server: ${SECRET_NAS1}
  mountOptions:
    - nfsvers=4.1
    - tcp
    - intr
    - hard
    - noatime
    - nodiratime
    - rsize=1048576
    - wsize=1048576

# values.yaml
    persistence:
      enabled: true
      accessMode: ReadWriteMany
      size: 1Mi
      existingClaim: nextcloud-nfs-config
lorenzo-w commented 1 year ago

Facing the same issue with the newest version of the Helm chart. Nextcloud container is stuck at Initializing nextcloud 27.0.0.8 ... when using a Longhorn RWX volume (served via NFS).

jessebot commented 1 year ago

I'm using Longhorn now on k3s having followed this guide, and although its definitely slower than local path (due to longhorn's abstraction layer), it seems to be working for me right now. I'm not using NFS though :/

Here's my current values.yaml. I'm using existing claims for both nextcloud files and postgres. Here's nextcloud's pvc.

I noticed one of the things I keep seeing is that the users in this thread with the failure to initialize are using two PVCs for nextcloud by using persistence.nextcloudData.enabled: true. Could anyone having the issue verify a couple of things for us?

  1. does this happen if you use longhorn without NFS?
  2. does this happen if you use only one PVC (i.e. persistence.nextcloudData.enabled: false)?

To clarify, this should work with two PVCs and I'm not suggesting we don't support that. I'm just trying to narrow down the exact issue. 🤔

edit: update links to point to specific commit in time

christensenjairus commented 1 year ago

@ScionOfDesign Your answer worked for me. Try adding the following to your values.yaml. If I remember correctly, this allows the livenessProbe and readinessProbe more time so they don't restart the container when it taking a while to install. If you need longer, you can raise these values.

startupProbe:
  enabled: true
  initialDelaySeconds: 120
  failureThreshold: 50

It'll take a while to install still. I think I saw somewhere that it took nearly two hours for some poor guy. For me though it usually takes 10-20 minutes.

jessebot commented 1 year ago

That's such a long install time though :o

mueller-tobias commented 8 months ago

Any Update on this Issue? We're facing the same issue and did configured the startProbe to be extremly long. But like @christensenjairus wrote. Updates takes 10-20 Minutes.

From my pov it has something to do with the rsync that will copy the file to /var/www/html. but i don't get it why it is so slow in the nextcloud container during init? I've other container using longhorn and RWX Volumes without that performance problems.

How can we support to get that issue fixed?

pfaelzerchen commented 8 months ago

It is not a longhorn only issue. I switched from longhorn to rook-ceph and saw similar issues. Last weekend, I wanted to upgrade from Nextcloud 27 to Nextcloud 28. The whole process took >10 minutes. So I just disabled the probes and re-enabled them later.

During this, I saw three rsync processes working. This is rather strange as I normally can rsync over serveral GiB in the same time when doing backups with the same storage backend.

pfaelzerchen commented 8 months ago
1. does this happen if you use longhorn without NFS?

When I was using longhorn, this problem did not appear with RWO pvcs. It was a NFS only problem. But as I stated: I didn't have this problem with rook-ceph, except with the major release upgrade from 27 to 28.

2. does this happen if you use only one PVC (i.e. `persistence.nextcloudData.enabled: false`)?

Yes. I am using one pvc only and it happend with longhorn RWX and for the release upgrade from 27 to 28 it happened on rook-ceph with just one pvc.

MohammedNoureldin commented 7 months ago

Did anyone find the real cause of it? I mean why it takes so long only with RWX? Or did anyone find any other (better) solution for this?

Would we have any performance issue when using RWX, or is it any initialization issue only?

I have been waiting for more than than 20 minutes after adding the startupProbe, but still nothing new is shown :/

Tim-herbie commented 6 months ago

I tried to set the startupProb like @christensenjairus and both pods were running after a few minutes. But the performance is very poor compared to before. After a few clicks, I got an Internal Server Error and it seems that the data is broken.

For me, I don´t see a possibility to deploy nextcloud high availbility right now. Or am I wrong with that?

My Config:
replicaCount: 2

startupProbe:
  enabled: true
  initialDelaySeconds: 120
  failureThreshold: 50

persistence:
  enabled: true
  accessMode: ReadWriteMany
  size: 8Gi

  nextcloudData:
    enabled: true
    accessMode: ReadWriteMany
    size: 8Gi
MohammedNoureldin commented 4 months ago

@Tim-herbie, is it still the case? I mean ignoring that after a few clicks you get internal error.

Using Longhorn with PVC RWX (NFS) takes 20-30 minutes to initialize Nextcloud. The performance is poor.

Have anyone figured out how to resolve it by fine-tuning some magic variables?

Tim-herbie commented 4 months ago

@MohammedNoureldin

I didn't find a solution to use more than one replica. I'm using it right know for private purposes with only one.

jessebot commented 4 months ago

More than 1 replica causing issues is a different issue, and if that's the case, please search the issues for an issue about that, or open a second issue. This issue is specifically about accessMode. Can anyone still working on this confirm if disabling all the probes helps? If it does, we'll close this as having a working around. If not, please let us know what errors you're getting.

mueller-tobias commented 4 months ago

I think he refers to that when he want's to use more then 1 replica he needs a RWX Volume which isn't currently working well due to the fact that the init/upgrade process needs so long.

jessebot commented 4 months ago

@mueller-tobias that makes sense, however, more than replica is still something that is a separate issue, as this issue implies that it would break even with one replica. Multiple replicas causing issues though is a known issue, but I can't seem to find the last time it was brought up 🤔

MohammedNoureldin commented 4 months ago

You are right, @jessebot I was particularly talking about the RWX on NFS. Even with 1 replica on RWX NFS the the whole initialization and performance are poor.

@mueller-tobias exactly, thank you.

The issue is here obvious, creating/copying files to the RWX volume takes too long. If you observe the volume during the initialization, you will see that on every page refresh the size increases by ~2MB. So you can imaging how long it will take to reach the 2.5GB (the estimated final initialization size).

Though the cause of the issue is not clear to me. I am not sure if this is an issue in Nextcloud, or in NFS itself? I mean should the solution in/by implemented by Nextcloud, or should it be by adapting the NFS configs? I saw people talk about turning off the NFS sync, with a small risk to lose some data. Losing data is not good. That is why I am still looking for other safer solution.

jessebot commented 4 months ago

Ok, we're on the same page then. To be sure, did you try the suggestions in https://github.com/nextcloud/helm/issues/399#issuecomment-1623875028 ? If those don't work, maybe tagging tvories to troubleshoot may be helpful.

Sorry for not being more helpful. I don't run NFS personally, but I've added an NFS label, as NFS comes up a frequently enough in the issues and I'm going to start grouping them all together for easier searching as I come by them.

MohammedNoureldin commented 4 months ago

Hi, @tvories, may I ask for your support?

I am trying to enhance the very poor performance and initialization time when using RWX with Nextcloud. @jessebot suggested to check the configuration you posted.

I am using Longhorn with NFS-common installed on all nodes.

I created a custom StorageClass and add the same configuration as you showed:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: longhorn-nfs-test
provisioner: driver.longhorn.io
allowVolumeExpansion: true
reclaimPolicy: Delete
volumeBindingMode: Immediate
parameters:
  numberOfReplicas: "2"
  staleReplicaTimeout: "2880"
  fromBackup: ""
  fsType: "ext4"
  nfsOptions: "nfsvers=4.2,tcp,,intr,hard,noatime,nodiratime,rsize=1048576,wsize=1048576"

Still I see the same horrible performance.

I noticed kind of in the beginning, the initialization was quick enough, but after the first 200 MB, it started to do probably less than 1MB/s and it keeps getting slowed and slower...

Do you have any suggestion to debug this please?

pfaelzerchen commented 4 months ago

Can anyone still working on this confirm if disabling all the probes helps? If it does, we'll close this as having a working around. If not, please let us know what errors you're getting.

I can confirm that disabling the probes is a functional workaround. But I would consider tweaking the startupProbe a better solution:

startupProbe:
  enabled: true
  initialDelaySeconds: 120 #30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 50 #30
  successThreshold: 1

This keeps Kubernetes waiting long enough to get even major upgrades done. Probably one has to play around with the initialDelaySeconds or failureThreshold if the performance is better or worse.

MohammedNoureldin commented 4 months ago

Can anyone still working on this confirm if disabling all the probes helps? If it does, we'll close this as having a working around. If not, please let us know what errors you're getting.

I can confirm that disabling the probes is a functional workaround. But I would consider tweaking the startupProbe a better solution:

startupProbe:
  enabled: true
  initialDelaySeconds: 120 #30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 50 #30
  successThreshold: 1

This keeps Kubernetes waiting long enough to get even major upgrades done. Probably one has to play around with the initialDelaySeconds or failureThreshold if the performance is better or worse.

That is just a workaround around the real issue, that the whole PV and initialization and even the performance of Nextcloud with RWX is horrible.

I understand that delaying the probes helps to run the software, but we should try to find a proper solution. Maybe by fine tuning the NFS options, or I don't know how, any suggestion would be great and helpful.

MohammedNoureldin commented 4 months ago

@tvories @jessebot I rechecked and can confirm what I mentioned in the comment above https://github.com/nextcloud/helm/issues/399#issuecomment-2142214767

Initializing Nextcloud on NFS starts with a good speed, the PV gets filled really quickly, so I can say starts with more than 25 MB/s, and slowly slows down, until about 200 MB of the PV is used, at this point it becomes horribly slow, almost 0.1 MB/s.

What could the cause be?

tvories commented 4 months ago

@MohammedNoureldin I see you have some NFS settings defined in your NFS StorageClass. I'm assuming that it has something to do with how your are hosting your NFS share or some configuration there. Do you have NFS v4.2 enabled on your NFS server? Have you tried adjusting some of your NFS settings to see if it makes a difference?

It's going to be hard to troubleshoot without knowing all of the details of your network and storage situation.

You could eliminate NFS being the culprit by trying a different storage class and seeing if another storage class works better?