Closed Gentoli closed 2 months ago
@Gentoli could you please upgrade to oCIS 5.0.0 by using the latest chart version from main
branch (eg. https://github.com/owncloud/ocis-charts/commit/5ca20867637b3a5a4cc0a7ba5d9f47cb25cea28a).
You'd either want to enable the postprocessing restart cronjob: https://github.com/owncloud/ocis-charts/blob/5ca20867637b3a5a4cc0a7ba5d9f47cb25cea28a/charts/ocis/values.yaml#L1414-L1420
or run this command in a postprocessing pod: ocis postprocessing restart -s finished
@wkloucek
Looks like that did the trick. Do you know what can I setup to capture what caused post processing to stop?
here are some logs from post processing: initial upload
Mar 24, 2024, 05:47:24.918 {"level":"error","service":"postprocessing","error":"Error publishing message to topic: nats: timeout","time":"2024-03-24T09:47:20Z","line":"github.com/owncloud/ocis/v2/services/postprocessing/pkg/service/service.go:178","message":"unable to publish event"}
Mar 24, 2024, 06:20:38.533 nats: slow consumer, messages dropped on connection [59] for subscription on "main-queue"
after restart
Mar 26, 2024, 02:52:03.987 {"level":"error","service":"postprocessing","uploadID":"5d65701d-7f7d-4d4a-bee8-24be5b72403a","error":"Failed to delete data: nats: timeout","time":"2024-03-26T06:51:54Z","line":"github.com/owncloud/ocis/v2/services/postprocessing/pkg/service/service.go:155","message":"cannot delete upload"}
@Gentoli I also experienced a setup where creating / modifying streams via NACK (Nats Stream Controller Kubernetes Operator) and starting oCIS at the same time lead to open connections to NATS that were somehow not functioning. Sadly I didn't find the time yet to build a reproducer outside that special environment. Probably you ran into the same issue. Do you use NACK by any chance, too? (see also https://github.com/owncloud/ocis-charts/blob/5ca20867637b3a5a4cc0a7ba5d9f47cb25cea28a/deployments/ocis-nats/helmfile.yaml#L51-L70)
@Gentoli are you sure the uploads were really stuck in postprocessing? If there is some "bottleneck" in postprocessing (e.g. virusscan), events might only be handled sequentially. So if 1 service needs to work off 16.000 events, and needs 2 seconds to finish one, this will take a long time. Meanwhile other uploads (also new ones) will look stuck as they are waiting for antivirus to scan them. If that is the case, the system will simply heal itself after it has finished processing all uploads.
@kobergj
are you sure the uploads were really stuck in postprocessing?
I think so, they are ready for postprocessing but nothing is done. The whole service uses less than 100m cpu after it got stuck and I don't have anything enabled manually. Even just before enabling the cornjob (a day after the initial upload), new uploads still fail.
@wkloucek
Oh, I didn't know there is an example for using external NATS.
I didn't find any documentation for how NATS should be setup for ocis. So I used the official chart for a cluster. I don't have NACK or any NATS k8s CRDs
Here is the helm values I used:
cluster:
enabled: true
jetstream:
enabled: true
fileStore:
pvc:
enabled: true
size: 200Gi
storageClassName: ceph-fs-replicated-ssd
memoryStore:
enabled: true
maxSize: 4Gi
Your values look good. The example also is using the official NATS chart.
Do I get right that you recreated all oCIS pods (eg. kubectl rollout restart deploy
) and uploads are still not available immediately?
@wkloucek
Yes, that's what I did for the restart. I have also rolled NATS.
Does the problem still persist with a newer version? With 5.0.6 we are not noticing any problems on a larger installations.
Let us close here. I have no knowledge of new reports of this problem.
Describe the bug
After creating a new ocis deployment on k8s, uploading a folder (~16,000 files, 17GiB) to a space with windows client, part of the uploaded folder and any new upload will stuck in processing.
Steps to reproduce
Expected behavior
Uploads does not break and can be downloaded
Actual behavior
Any upload stucks in processing (425 Too Early):
Setup
`owncloud/ocis:5.0.0-rc.2` default helm settings with: - pvc for all persistence - `nats:4222` for all nats address - security context gid/uid overwrite to `1001190000` (required by OKD/OpenShift) - s3ng with ceph rgw - tracing targeting in-memory jaeger (added after uploads stopped working for debugging)
Additional context
ocis is stuck as is, happy to run any command or tools to troubleshoot this
Some observations:
/var/lib/ocis/storage/users/spaces/4b/b99030-b757-45f7-8541-2dc888287070/nodes/4b/b9/90/30/-b757-45f7-8541-2dc888287070/install.sh
)upload not found
in storageusers during the initial upload of 16000 files, no error logs for new upload after. not sure if it's related to https://github.com/owncloud/ocis/issues/7026 as with only this message, I was able to download newly uploaded files