owncloud / ocis

:atom_symbol: ownCloud Infinite Scale Stack
https://doc.owncloud.com/ocis/next/
Apache License 2.0
1.37k stars 180 forks source link

When uploading: error doing PATCH request to data gateway #10156

Open yuri-becker opened 3 days ago

yuri-becker commented 3 days ago

Describe the bug

File uploads via the web ui fail and the server has a high CPU usage without any user activity.

Steps to reproduce

Try to upload a file via the web ui.

Expected behavior

The upload succeedes and the file is available.

Actual behavior

Upload fails with Unknown Error. For other users, the upload seemes to have succeded, but the file is grayed out and not available. The log the contains entries like this:

{"level":"error","service":"ocdav","name":"com.owncloud.web.ocdav","traceid":"8ff9c83be64eddf7f24fc1f19a3c2769","request-id":"6dde09b8-4611-4743-a38e-bd8605657e00","path":"/public/POcnTlUcNrxpUXb/<redacted>","error":"Patch \"https://cloud.catboy.house/data/<redacted>\": context canceled","time":"2024-09-25T06:49:43Z","message":"error doing PATCH request to data gateway"}

Also, this log keeps popping up

{"level":"error","service":"postprocessing","uploadID":"da362552-3dfb-4952-a535-89226a610a0f","error":"Failed to delete data: nats: timeout","time":"2024-09-25T07:27:18Z","message":"cannot delete upload"}

Setup

The server is managed via Ansible. The container setup is this:

```yaml - name: 🚀 Starts container community.docker.docker_container: name: "{{ owncloud_container_name }}" image: owncloud/ocis state: started restart_policy: unless-stopped ports: - "{{ docker_gateway }}:{{ owncloud_service_port }}:9200" entrypoint: - /bin/sh command: ["-c", "ocis init || true; ocis server"] env: OCIS_INSECURE: "false" OCIS_URL: "https://{{ owncloud_domain }}" PROXY_HTTP_ADDR: "0.0.0.0:9200" PROXY_TLS: "false" OCIS_LOG_LEVEL: "warn" OCIS_ADD_RUN_SERVICES: "notifications" NOTIFICATIONS_SMTP_HOST: "{{ smtp_server }}" NOTIFICATIONS_SMTP_PORT: "{{ smtp_port }}" NOTIFICATIONS_SMTP_SENDER: "{{ owncloud_smtp_sender }}" NOTIFICATIONS_SMTP_USERNAME: "{{ owncloud_smtp_user }}" NOTIFICATIONS_SMTP_PASSWORD: "{{ owncloud_smtp_password }}" NOTIFICATIONS_SMTP_INSECURE: "false" user: "{{ owncloud_uid }}:{{ app_gid }}" volumes: - "{{ owncloud_volume_config }}:/etc/ocis" - "{{ owncloud_volume_data }}:/var/lib/ocis" ```

The reverse proxy (caddy) has this setup for owncloud:

```caddy cloud.catboy.house { reverse_proxy 172.17.0.1:28946 { request_buffers 256k response_buffers 256k } } ```

The complete setup is here: https://github.com/yuri-becker/catboy-house/tree/main/roles/owncloud

Additional context

I assume this is some kind of permission issue. Is OCIS having issues with changing the user? Uploading via the client seems to work, or seemed to have worked.

rhafer commented 3 days ago

@yuri-becker I guess we need more information

I assume this is some kind of permission issue. Is OCIS having issues with changing the user?

If that user has write permissions on the volumes/directories it shouldn't.

yuri-becker commented 3 days ago

@rhafer Thanks for your reply.

rhafer commented 3 days ago

Hm, there's tons of nats related errors in the logs (timeout, warning about dropped messages). Something seems to be really weird there. Is that server under constant high CPU load or just when you're trying to upload?

Also, can you check what ocis storage-users uploads sessions says? Are the any sessions stuck? Maybe it helps to clean them up. (See https://doc.owncloud.com/ocis/5.0/deployment/services/s-list/storage-users.html#sessions-command for details).

Could you by chance try if a clean setup of ocis 6.4.0 (using separate volumes) shows the same behavior on your system?

BTW, was the behavior broken from the beginning? Or could you sucessfully upload files at some point before?

yuri-becker commented 3 days ago

Yea, OCIS is producing a high CPU load constantly without user interaction, possibly because of re-trying to delete uploads over and over. ocis storage-users uploads sessions outputs me a ton of sessions. Clearing them didn't help. On a fresh install, the upload worked after a few tries.

rhafer commented 3 days ago

Yea, OCIS is producing a high CPU load constantly without user interaction, possibly because of re-trying to delete uploads over and over. ocis storage-users uploads sessions outputs me a ton of sessions. Clearing them didn't help.

Does that mean the sessions are still visible after ocis storage-users uploads sessions --clean?

To get the nats servcie back into a sane state you could try to clear the nats folder in you data volume while ocis is stopped. And then restart ocis, that will re-initialize the nats directory.

On a fresh install, the upload worked after a few tries.

What does "after a few tries" mean here? Which release is that fresh install using?

Also what kind of machine is this running on? How much RAM what kind of CPU, storage, etc.

yuri-becker commented 3 days ago

Does that mean the sessions are still visible after ocis storage-users uploads sessions --clean?

The table was empty afterwards.

To get the nats servcie back into a sane state you could try to clear the nats folder in you data volume while ocis is stopped. And then restart ocis, that will re-initialize the nats directory.

That seemed to have fixed it temporarily. Once i try to upload a "larger" file (19MB), it goes back to being broken again.

What does "after a few tries" mean here? Which release is that fresh install using?

Fails twice, then upload succeeds on the third try. Then fails twice again and succeeds on the third try once more. That was only a <1MB file tho.

Also what kind of machine is this running on? How much RAM what kind of CPU, storage, etc.

I have 43 TB free storage, 126GB of RAM (of which 14GB is used) and a Xeon E5-1650 v3 (6 cores at 3.8 Ghz).

EDIT: The fresh install was using the same release, I'm running OCIS in Docker and it should pull the latest version daily.

rhafer commented 3 days ago

EDIT: The fresh install was using the same release, I'm running OCIS in Docker and it should pull the latest version daily.

We've recently introduced the so-called rolling releases which we publish roughly every 3 weeks (see https://owncloud.dev/ocis/release_roadmap/#release-types) . They go to a different docker repo: https://hub.docker.com/r/owncloud/ocis-rolling it might be helpful to check if the issue still appears with that latest release from there (which is 6.4.0)

yuri-becker commented 3 days ago

The issue still persists on 6.4.0.

rhafer commented 3 days ago

Ok, thanks for trying.

Could you please set OCIS_LOG_LEVEL=debug, start a clean ocis (ideally 6.4.0) instance, reproduce the upload issue and then attach the full log of that?

yuri-becker commented 3 days ago

Done, funnily enough, on this install, I wasn't able to upload even small files. owncloud_logs_fresh_install.txt

rhafer commented 2 days ago

Ok. I think that gets us a bit further I think. I can see a couple of successful uploads. But there is also at least one upload failing:

2024-09-25T15:16:38Z INF access-log | service=proxy proto=HTTP/1.1 request-id=a1c6abd7-29be-454b-9a52-3d26fac34430 traceid=b48f7713426e8f72d7e28fb5f67ef253 remote-addr=93.208.197.202 method=POST status=502 path=/remote.php/dav/spaces/c6d8a02e-77f9-409c-93e2-12b72659fa41$b69dbc42-e8bd-4e24-a626-4d6aae354d64 duration=56.566584 bytes=0 line=github.com/owncloud/ocis/v2/services/proxy/pkg/middleware/accesslog.go:34 
2024-09-25T15:16:38Z DBG Decomposedfs: built session info | service=storage-users pkg=rgrpc traceid=b48f7713426e8f72d7e28fb5f67ef253 session={} line=github.com/cs3org/reva/v2@v2.24.1/pkg/storage/utils/decomposedfs/upload.go:305 
2024-09-25T15:16:38Z INF file upload | service=storage-users pkg=rgrpc traceid=b48f7713426e8f72d7e28fb5f67ef253 data-server=http://localhost:9158/data/simple/26ae9dd0-ea47-4ef3-bbeb-0538c052aa56 fn=./small_file.stl xs=map[md5:100 unset:1000] line=github.com/cs3org/reva/v2@v2.24.1/internal/grpc/services/storageprovider/storageprovider.go:470 
2024-09-25T15:16:38Z INF file upload | service=storage-users pkg=rgrpc traceid=b48f7713426e8f72d7e28fb5f67ef253 data-server=http://localhost:9158/data/tus/26ae9dd0-ea47-4ef3-bbeb-0538c052aa56 fn=./small_file.stl xs=map[md5:100 unset:1000] line=github.com/cs3org/reva/v2@v2.24.1/internal/grpc/services/storageprovider/storageprovider.go:470 
2024-09-25T15:16:38Z DBG unary | service=storage-users pkg=rgrpc traceid=b48f7713426e8f72d7e28fb5f67ef253 user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36 from=tcp://127.0.0.1:49880 uri=/cs3.storage.provider.v1beta1.ProviderAPI/InitiateFileUpload start=25/Sep/2024:15:16:38 +0000 end=25/Sep/2024:15:16:38 +0000 time_ns=61016562 code=OK line=github.com/cs3org/reva/v2@v2.24.1/internal/grpc/interceptors/log/log.go:69 
2024-09-25T15:16:38Z DBG unary | service=gateway pkg=rgrpc traceid=b48f7713426e8f72d7e28fb5f67ef253 user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36 from=tcp://127.0.0.1:51788 uri=/cs3.gateway.v1beta1.GatewayAPI/InitiateFileUpload start=25/Sep/2024:15:16:38 +0000 end=25/Sep/2024:15:16:38 +0000 time_ns=62587837 code=OK line=github.com/cs3org/reva/v2@v2.24.1/internal/grpc/interceptors/log/log.go:69 
2024-09-25T15:16:38Z ERR error doing PATCH request to data gateway | service=ocdav name=com.owncloud.web.ocdav traceid=b48f7713426e8f72d7e28fb5f67ef253 request-id=a1c6abd7-29be-454b-9a52-3d26fac34430 spaceid=c6d8a02e-77f9-409c-93e2-12b72659fa41$b69dbc42-e8bd-4e24-a626-4d6aae354d64 path=/ filename=small_file.stl error=Patch "https://cloud.catboy.house/data/eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJhdWQiOiJyZXZhIiwiZXhwIjoxNzI3MzYzNzk4LCJpYXQiOjE3MjcyNzczOTgsInRhcmdldCI6Imh0dHA6Ly9sb2NhbGhvc3Q6OTE1OC9kYXRhL3R1cy8yNmFlOWRkMC1lYTQ3LTRlZjMtYmJlYi0wNTM4YzA1MmFhNTYifQ.25taQcy58gE_SF8KzQzInVeAeaptFimE2u_gWLHoJy8": context canceled line=github.com/cs3org/reva/v2@v2.24.1/internal/http/services/owncloud/ocdav/tus.go:298

As part of the TUS upload processing ocis needs to send a PATCH request to the so-called data gateway. It's sending that to the public ocis URL currently. So that request will go through your reverse proxy (caddy, I believe). For some (yet unknown) reason. The request is never received by ocis for this specific upload, at least there is no correlating access-log message for the PATCH request with a matching request-id. So it somehow got lost between ocis and the reverse proxy.

Do you have any logs from your reverse proxy from around that time? If not, you might wanna increase the log level of the reverse proxy and reproduce the issue once more.

yuri-becker commented 2 days ago

I havn't been logging but enabled logging and re-tried uploads. Here you go. Thanks again for all your help. cloud.catboy.house-access.log