owncloud / ocis

:atom_symbol: ownCloud Infinite Scale Stack
https://doc.owncloud.com/ocis/next/
Apache License 2.0
1.4k stars 182 forks source link

All new uploads fail to process (425 Too Early) #8720

Closed Gentoli closed 2 months ago

Gentoli commented 7 months ago

Describe the bug

After creating a new ocis deployment on k8s, uploading a folder (~16,000 files, 17GiB) to a space with windows client, part of the uploaded folder and any new upload will stuck in processing.

Steps to reproduce

  1. create ocis with external NATS cluster
  2. create space
  3. copy/upload 16000 files to space (mostly files smaller than 2MB)
  4. upload any new file

Expected behavior

Uploads does not break and can be downloaded

Actual behavior

Any upload stucks in processing (425 Too Early): image

Setup

`owncloud/ocis:5.0.0-rc.2` default helm settings with: - pvc for all persistence - `nats:4222` for all nats address - security context gid/uid overwrite to `1001190000` (required by OKD/OpenShift) - s3ng with ceph rgw - tracing targeting in-memory jaeger (added after uploads stopped working for debugging)

Additional context

ocis is stuck as is, happy to run any command or tools to troubleshoot this

Some observations:

wkloucek commented 7 months ago

@Gentoli could you please upgrade to oCIS 5.0.0 by using the latest chart version from main branch (eg. https://github.com/owncloud/ocis-charts/commit/5ca20867637b3a5a4cc0a7ba5d9f47cb25cea28a).

You'd either want to enable the postprocessing restart cronjob: https://github.com/owncloud/ocis-charts/blob/5ca20867637b3a5a4cc0a7ba5d9f47cb25cea28a/charts/ocis/values.yaml#L1414-L1420

or run this command in a postprocessing pod: ocis postprocessing restart -s finished

Gentoli commented 7 months ago

@wkloucek

Looks like that did the trick. Do you know what can I setup to capture what caused post processing to stop?

here are some logs from post processing: initial upload

Mar 24, 2024, 05:47:24.918 {"level":"error","service":"postprocessing","error":"Error publishing message to topic: nats: timeout","time":"2024-03-24T09:47:20Z","line":"github.com/owncloud/ocis/v2/services/postprocessing/pkg/service/service.go:178","message":"unable to publish event"}
Mar 24, 2024, 06:20:38.533 nats: slow consumer, messages dropped on connection [59] for subscription on "main-queue"

after restart

Mar 26, 2024, 02:52:03.987 {"level":"error","service":"postprocessing","uploadID":"5d65701d-7f7d-4d4a-bee8-24be5b72403a","error":"Failed to delete data: nats: timeout","time":"2024-03-26T06:51:54Z","line":"github.com/owncloud/ocis/v2/services/postprocessing/pkg/service/service.go:155","message":"cannot delete upload"}
wkloucek commented 7 months ago

@Gentoli I also experienced a setup where creating / modifying streams via NACK (Nats Stream Controller Kubernetes Operator) and starting oCIS at the same time lead to open connections to NATS that were somehow not functioning. Sadly I didn't find the time yet to build a reproducer outside that special environment. Probably you ran into the same issue. Do you use NACK by any chance, too? (see also https://github.com/owncloud/ocis-charts/blob/5ca20867637b3a5a4cc0a7ba5d9f47cb25cea28a/deployments/ocis-nats/helmfile.yaml#L51-L70)

kobergj commented 7 months ago

@Gentoli are you sure the uploads were really stuck in postprocessing? If there is some "bottleneck" in postprocessing (e.g. virusscan), events might only be handled sequentially. So if 1 service needs to work off 16.000 events, and needs 2 seconds to finish one, this will take a long time. Meanwhile other uploads (also new ones) will look stuck as they are waiting for antivirus to scan them. If that is the case, the system will simply heal itself after it has finished processing all uploads.

Gentoli commented 7 months ago

@kobergj

are you sure the uploads were really stuck in postprocessing?

I think so, they are ready for postprocessing but nothing is done. The whole service uses less than 100m cpu after it got stuck and I don't have anything enabled manually. Even just before enabling the cornjob (a day after the initial upload), new uploads still fail. image

Gentoli commented 7 months ago

@wkloucek

Oh, I didn't know there is an example for using external NATS.

I didn't find any documentation for how NATS should be setup for ocis. So I used the official chart for a cluster. I don't have NACK or any NATS k8s CRDs

Here is the helm values I used:

cluster:
  enabled: true
jetstream:
  enabled: true
  fileStore:
    pvc:
      enabled: true
      size: 200Gi
      storageClassName: ceph-fs-replicated-ssd
  memoryStore:
    enabled: true
    maxSize: 4Gi
wkloucek commented 7 months ago

Your values look good. The example also is using the official NATS chart.

Do I get right that you recreated all oCIS pods (eg. kubectl rollout restart deploy) and uploads are still not available immediately?

Gentoli commented 7 months ago

@wkloucek

Yes, that's what I did for the restart. I have also rolled NATS.

wkloucek commented 2 months ago

Does the problem still persist with a newer version? With 5.0.6 we are not noticing any problems on a larger installations.

micbar commented 2 months ago

Let us close here. I have no knowledge of new reports of this problem.