Open ScionOfDesign opened 1 year ago
It apparently has the same issue if I try to use my own storage classes and configuration.
persistence:
# Nextcloud Data (/var/www/html)
enabled: true
#existingClaim: nextcloud-html-data-claim
storageClass: longhorn-nvme
accessMode: ReadWriteMany
size: 8Gi
nextcloudData:
enabled: true
storageClass: longhorn-block
accessMode: ReadWriteMany
size: 10Gi
#existingClaim: nextcloud-user-data-claim
It seems that the issue is with the accessMode
of the primary persistent volume. It cannot be ReadWriteMany
.
This works:
persistence:
# Nextcloud Data (/var/www/html)
enabled: true
#existingClaim: nextcloud-html-data-claim
storageClass: longhorn-block
#accessMode: ReadWriteOnce
#size: 8Gi
nextcloudData:
enabled: true
storageClass: longhorn-block
accessMode: ReadWriteMany
#size: 10Gi
#existingClaim: nextcloud-user-data-claim
The issue seems to be related to: https://github.com/nextcloud/helm/issues/10 Disabling the probes worked. I will continue to investigate.
Having the same issue using my own storage class as well. Disabling probes doesnt help as the server seems to have trouble starting.
Same issue for me too.
Only difference disabling probes had for me is that the pod is now 'running', but the same issue persists - Initializing nextcloud 26.0.2.1 ...
hmmm, for those having this issue, could you let me know if there's any Events
listed when you do a:
# replace $NEXTCLOUD_POD with your actual pod name
kubectl describe pod $NEXTCLOUD_POD
Similarly, the existing claims, do they have any Events
when you run a describe?
# replace $NEXTCLOUD_PVC with your actual pvc name
kubectl describe pvc $NEXTCLOUD_PVC
Also does the status show pending there for the PVC?
@tvories or @provokateurin have you tried using ReadWriteMany
PVCs with the nextcloud container before? I tried to use longhorn at one point, but couldn't get it working, and assumed it was because I misconfigured longhorn so I gave up and went back to the local path with k3s 🤔 We'd need to have ReadWriteMany
working in order to support multiple pod replicas accross multiple nodes accessing the same PVC but I'm unsure what's currently blocking that....
Anyone else in the community who has knowledge on this is also welcome to give input :)
A have used this chart with longhorn in the past, but it was RWO iirc.
I am using ReadWriteMany on an NFS mount for my primary Nextcloud storage and have been for a very long time:
@ScionOfDesign can you paste your PVC values? My guess is that this is a Longhorn configuration issue.
I am using existingClaim
for my pvc rather than having the chart create it.
# pvc.yaml
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: nextcloud-nfs-config
spec:
storageClassName: nextcloud-nfs-config
capacity:
storage: 1Mi
accessModes:
- ReadWriteMany
nfs:
path: /mnt/fatguys/k8s/nextcloud
server: ${SECRET_NAS1}
mountOptions:
- nfsvers=4.1
- tcp
- intr
- hard
- noatime
- nodiratime
- rsize=1048576
- wsize=1048576
# values.yaml
persistence:
enabled: true
accessMode: ReadWriteMany
size: 1Mi
existingClaim: nextcloud-nfs-config
Facing the same issue with the newest version of the Helm chart. Nextcloud container is stuck at Initializing nextcloud 27.0.0.8 ...
when using a Longhorn RWX volume (served via NFS).
I'm using Longhorn now on k3s having followed this guide, and although its definitely slower than local path (due to longhorn's abstraction layer), it seems to be working for me right now. I'm not using NFS though :/
Here's my current values.yaml
. I'm using existing claims for both nextcloud files and postgres. Here's nextcloud's pvc.
I noticed one of the things I keep seeing is that the users in this thread with the failure to initialize are using two PVCs for nextcloud by using persistence.nextcloudData.enabled: true
. Could anyone having the issue verify a couple of things for us?
persistence.nextcloudData.enabled: false
)?To clarify, this should work with two PVCs and I'm not suggesting we don't support that. I'm just trying to narrow down the exact issue. 🤔
edit: update links to point to specific commit in time
@ScionOfDesign Your answer worked for me. Try adding the following to your values.yaml
. If I remember correctly, this allows the livenessProbe
and readinessProbe
more time so they don't restart the container when it taking a while to install. If you need longer, you can raise these values.
startupProbe:
enabled: true
initialDelaySeconds: 120
failureThreshold: 50
It'll take a while to install still. I think I saw somewhere that it took nearly two hours for some poor guy. For me though it usually takes 10-20 minutes.
That's such a long install time though :o
Any Update on this Issue? We're facing the same issue and did configured the startProbe to be extremly long. But like @christensenjairus wrote. Updates takes 10-20 Minutes.
From my pov it has something to do with the rsync that will copy the file to /var/www/html
. but i don't get it why it is so slow in the nextcloud container during init?
I've other container using longhorn and RWX Volumes without that performance problems.
How can we support to get that issue fixed?
It is not a longhorn only issue. I switched from longhorn to rook-ceph and saw similar issues. Last weekend, I wanted to upgrade from Nextcloud 27 to Nextcloud 28. The whole process took >10 minutes. So I just disabled the probes and re-enabled them later.
During this, I saw three rsync processes working. This is rather strange as I normally can rsync over serveral GiB in the same time when doing backups with the same storage backend.
1. does this happen if you use longhorn without NFS?
When I was using longhorn, this problem did not appear with RWO pvcs. It was a NFS only problem. But as I stated: I didn't have this problem with rook-ceph, except with the major release upgrade from 27 to 28.
2. does this happen if you use only one PVC (i.e. `persistence.nextcloudData.enabled: false`)?
Yes. I am using one pvc only and it happend with longhorn RWX and for the release upgrade from 27 to 28 it happened on rook-ceph with just one pvc.
Did anyone find the real cause of it? I mean why it takes so long only with RWX? Or did anyone find any other (better) solution for this?
Would we have any performance issue when using RWX, or is it any initialization issue only?
I have been waiting for more than than 20 minutes after adding the startupProbe, but still nothing new is shown :/
I tried to set the startupProb like @christensenjairus and both pods were running after a few minutes. But the performance is very poor compared to before. After a few clicks, I got an Internal Server Error and it seems that the data is broken.
For me, I don´t see a possibility to deploy nextcloud high availbility right now. Or am I wrong with that?
My Config:
replicaCount: 2
startupProbe:
enabled: true
initialDelaySeconds: 120
failureThreshold: 50
persistence:
enabled: true
accessMode: ReadWriteMany
size: 8Gi
nextcloudData:
enabled: true
accessMode: ReadWriteMany
size: 8Gi
@Tim-herbie, is it still the case? I mean ignoring that after a few clicks you get internal error.
Using Longhorn with PVC RWX (NFS) takes 20-30 minutes to initialize Nextcloud. The performance is poor.
Have anyone figured out how to resolve it by fine-tuning some magic variables?
@MohammedNoureldin
I didn't find a solution to use more than one replica. I'm using it right know for private purposes with only one.
More than 1 replica causing issues is a different issue, and if that's the case, please search the issues for an issue about that, or open a second issue. This issue is specifically about accessMode. Can anyone still working on this confirm if disabling all the probes helps? If it does, we'll close this as having a working around. If not, please let us know what errors you're getting.
I think he refers to that when he want's to use more then 1 replica he needs a RWX Volume which isn't currently working well due to the fact that the init/upgrade process needs so long.
@mueller-tobias that makes sense, however, more than replica is still something that is a separate issue, as this issue implies that it would break even with one replica. Multiple replicas causing issues though is a known issue, but I can't seem to find the last time it was brought up 🤔
You are right, @jessebot I was particularly talking about the RWX on NFS. Even with 1 replica on RWX NFS the the whole initialization and performance are poor.
@mueller-tobias exactly, thank you.
The issue is here obvious, creating/copying files to the RWX volume takes too long. If you observe the volume during the initialization, you will see that on every page refresh the size increases by ~2MB. So you can imaging how long it will take to reach the 2.5GB (the estimated final initialization size).
Though the cause of the issue is not clear to me. I am not sure if this is an issue in Nextcloud, or in NFS itself? I mean should the solution in/by implemented by Nextcloud, or should it be by adapting the NFS configs? I saw people talk about turning off the NFS sync, with a small risk to lose some data. Losing data is not good. That is why I am still looking for other safer solution.
Ok, we're on the same page then. To be sure, did you try the suggestions in https://github.com/nextcloud/helm/issues/399#issuecomment-1623875028 ? If those don't work, maybe tagging tvories to troubleshoot may be helpful.
Sorry for not being more helpful. I don't run NFS personally, but I've added an NFS label, as NFS comes up a frequently enough in the issues and I'm going to start grouping them all together for easier searching as I come by them.
Hi, @tvories, may I ask for your support?
I am trying to enhance the very poor performance and initialization time when using RWX with Nextcloud. @jessebot suggested to check the configuration you posted.
I am using Longhorn with NFS-common installed on all nodes.
I created a custom StorageClass
and add the same configuration as you showed:
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: longhorn-nfs-test
provisioner: driver.longhorn.io
allowVolumeExpansion: true
reclaimPolicy: Delete
volumeBindingMode: Immediate
parameters:
numberOfReplicas: "2"
staleReplicaTimeout: "2880"
fromBackup: ""
fsType: "ext4"
nfsOptions: "nfsvers=4.2,tcp,,intr,hard,noatime,nodiratime,rsize=1048576,wsize=1048576"
Still I see the same horrible performance.
I noticed kind of in the beginning, the initialization was quick enough, but after the first 200 MB, it started to do probably less than 1MB/s and it keeps getting slowed and slower...
Do you have any suggestion to debug this please?
Can anyone still working on this confirm if disabling all the probes helps? If it does, we'll close this as having a working around. If not, please let us know what errors you're getting.
I can confirm that disabling the probes is a functional workaround. But I would consider tweaking the startupProbe
a better solution:
startupProbe:
enabled: true
initialDelaySeconds: 120 #30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 50 #30
successThreshold: 1
This keeps Kubernetes waiting long enough to get even major upgrades done. Probably one has to play around with the initialDelaySeconds or failureThreshold if the performance is better or worse.
Can anyone still working on this confirm if disabling all the probes helps? If it does, we'll close this as having a working around. If not, please let us know what errors you're getting.
I can confirm that disabling the probes is a functional workaround. But I would consider tweaking the
startupProbe
a better solution:startupProbe: enabled: true initialDelaySeconds: 120 #30 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 50 #30 successThreshold: 1
This keeps Kubernetes waiting long enough to get even major upgrades done. Probably one has to play around with the initialDelaySeconds or failureThreshold if the performance is better or worse.
That is just a workaround around the real issue, that the whole PV and initialization and even the performance of Nextcloud with RWX is horrible.
I understand that delaying the probes helps to run the software, but we should try to find a proper solution. Maybe by fine tuning the NFS options, or I don't know how, any suggestion would be great and helpful.
@tvories @jessebot I rechecked and can confirm what I mentioned in the comment above https://github.com/nextcloud/helm/issues/399#issuecomment-2142214767
Initializing Nextcloud on NFS starts with a good speed, the PV gets filled really quickly, so I can say starts with more than 25 MB/s, and slowly slows down, until about 200 MB of the PV is used, at this point it becomes horribly slow, almost 0.1 MB/s.
What could the cause be?
@MohammedNoureldin I see you have some NFS settings defined in your NFS StorageClass. I'm assuming that it has something to do with how your are hosting your NFS share or some configuration there. Do you have NFS v4.2 enabled on your NFS server? Have you tried adjusting some of your NFS settings to see if it makes a difference?
It's going to be hard to troubleshoot without knowing all of the details of your network and storage situation.
You could eliminate NFS being the culprit by trying a different storage class and seeing if another storage class works better?
Describe your Issue
When trying to able persistence for NextCloud, the container hangs when attempting to us emy own existing pvc's.
Logs and Errors
The container hangs with the following log:
Describe your Environment
Kubernetes distribution: rke2
Helm Version (or App that manages helm): v3.11.3
Helm Chart Version: 3.5.12
My persistence section:
Additional context, if any
It works fine if I comment out the usage of existing claims.