owncloud / ocis

:atom_symbol: ownCloud Infinite Scale Stack
https://doc.owncloud.com/ocis/next/
Apache License 2.0
1.4k stars 183 forks source link

Finding user home space is flaky #3749

Closed pascalwengerter closed 2 years ago

pascalwengerter commented 2 years ago

Describe the bug

A clear and concise description of what the bug is.

Steps to reproduce

Steps to reproduce the behavior:

  1. Log in as any user
  2. Loading personal space fails unreliably

Expected behavior

Solid loading of personal space

Actual behavior

Flakiness, both when navigating and when doing full-page reloads

Logs

Success case

{"level":"info","service":"gateway","pkg":"rgrpc","traceid":"00000000000000000000000000000000","time":"2022-05-09T18:00:18+02:00","message":"listing public shares"}
{"level":"info","service":"sharing","pkg":"rgrpc","traceid":"00000000000000000000000000000000","publicshareprovider":"list","time":"2022-05-09T18:00:18+02:00","message":"list public share"}
{"level":"debug","service":"sharing","pkg":"rgrpc","traceid":"00000000000000000000000000000000","user-agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:100.0) Gecko/20100101 Firefox/100.0","from":"tcp://127.0.0.1:62003","uri":"/cs3.sharing.link.v1beta1.LinkAPI/ListPublicShares","start":"09/May/2022:18:00:18 +0200","end":"09/May/2022:18:00:18 +0200","time_ns":845542,"code":"OK","time":"2022-05-09T18:00:18+02:00","message":"unary"}
{"level":"debug","service":"gateway","pkg":"rgrpc","traceid":"00000000000000000000000000000000","user-agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:100.0) Gecko/20100101 Firefox/100.0","from":"tcp://127.0.0.1:61919","uri":"/cs3.gateway.v1beta1.GatewayAPI/ListPublicShares","start":"09/May/2022:18:00:18 +0200","end":"09/May/2022:18:00:18 +0200","time_ns":2089375,"code":"OK","time":"2022-05-09T18:00:18+02:00","message":"unary"}
{"level":"debug","service":"ocdav","name":"com.owncloud.web.ocdav","traceid":"00000000000000000000000000000000","md":{"type":2,"id":{"storage_id":"1284d238-aa92-42ce-bdc4-0b0000009157$058bff95-6708-4fe5-91e4-9ea3d377588b","opaque_id":"058bff95-6708-4fe5-91e4-9ea3d377588b"},"etag":"\"6c6e688fc3d18187aa9cc3c409027ea0\"","mime_type":"httpd/unix-directory","mtime":{"seconds":1652111982,"nanos":543652874},"path":"/1284d238-aa92-42ce-bdc4-0b0000009157$058bff95-6708-4fe5-91e4-9ea3d377588b","permission_set":{"add_grant":true,"create_container":true,"delete":true,"get_path":true,"get_quota":true,"initiate_file_download":true,"initiate_file_upload":true,"list_grants":true,"list_container":true,"list_file_versions":true,"list_recycle":true,"move":true,"remove_grant":true,"purge_recycle":true,"restore_file_version":true,"restore_recycle_item":true,"stat":true,"update_grant":true},"owner":{"idp":"https://localhost:9200","opaque_id":"058bff95-6708-4fe5-91e4-9ea3d377588b","type":1},"arbitrary_metadata":{"metadata":{"http://owncloud.org/ns/favorite":""}}},"ns":"","role":{"Name":"manager"},"dav-permissions":"RDNVCK","time":"2022-05-09T18:00:18+02:00","message":"converted PermissionSet"}
{"level":"info","service":"proxy","proto":"HTTP/1.1","request":"fe47dac0-3a5a-42e4-b33d-1f50e000fc46","remote-addr":"127.0.0.1:61924","method":"PROPFIND","status":207,"path":"/remote.php/dav/spaces/1284d238-aa92-42ce-bdc4-0b0000009157$058bff95-6708-4fe5-91e4-9ea3d377588b","duration":15.399459,"bytes":1083,"time":"2022-05-09T18:00:18+02:00","message":"access-log"}

Failure case


{"level":"debug","service":"proxy","claims":{"email":"moss@example.org","email_verified":false,"family_name":"Moss","given_name":"Maurice","name":"Maurice Moss","preferred_username":"moss","sub":"PBcG4gQTJZl-U92i@5HvWa9OJl9iiQyTS8ov76TiCiyp0Oray0MJKf-d4yPge5mDBA_x21oXJtHoXbY3U--Qeeg"},"user":{"id":{"idp":"https://localhost:9200","opaque_id":"058bff95-6708-4fe5-91e4-9ea3d377588b","type":1},"username":"moss","mail":"moss@example.org","display_name":"Maurice Moss","groups":["509a9dcd-bb37-4f4f-a01a-19dca27d9cfa"],"opaque":{"map":{"roles":{"decoder":"json","value":"WyI3MTg4MTg4My0xNzY4LTQ2YmQtYTI0ZC1hMzU2YTJhZmRmN2YiXQ=="}}},"uid_number":99,"gid_number":99},"time":"2022-05-09T18:00:20+02:00","message":"associated claims with user"}
{"level":"debug","service":"gateway","pkg":"rgrpc","traceid":"00000000000000000000000000000000","method":"/cs3.gateway.v1beta1.GatewayAPI/CreateHome","time":"2022-05-09T18:00:20+02:00","message":"skipping auth"}
{"level":"debug","service":"gateway","pkg":"rgrpc","traceid":"00000000000000000000000000000000","user-agent":"grpc-go/1.46.0","from":"tcp://127.0.0.1:61958","uri":"/cs3.storage.registry.v1beta1.RegistryAPI/GetStorageProviders","start":"09/May/2022:18:00:20 +0200","end":"09/May/2022:18:00:20 +0200","time_ns":70209,"code":"OK","time":"2022-05-09T18:00:20+02:00","message":"unary"}
{"level":"debug","service":"storage-users","pkg":"rgrpc","traceid":"00000000000000000000000000000000","user-agent":"grpc-go/1.46.0","from":"tcp://127.0.0.1:61959","uri":"/cs3.storage.provider.v1beta1.ProviderAPI/CreateStorageSpace","start":"09/May/2022:18:00:20 +0200","end":"09/May/2022:18:00:20 +0200","time_ns":312750,"code":"OK","time":"2022-05-09T18:00:20+02:00","message":"unary"}
{"level":"debug","service":"gateway","pkg":"rgrpc","traceid":"00000000000000000000000000000000","user-agent":"grpc-go/1.46.0","from":"tcp://127.0.0.1:61917","uri":"/cs3.gateway.v1beta1.GatewayAPI/CreateHome","start":"09/May/2022:18:00:20 +0200","end":"09/May/2022:18:00:20 +0200","time_ns":775625,"code":"OK","time":"2022-05-09T18:00:20+02:00","message":"unary"}
{"level":"debug","service":"proxy","policy":"ocis","method":"PROPFIND","prefix":"/remote.php/","path":"/remote.php/dav/spaces/1284d238-aa92-42ce-bdc4-0b0000009157$058bff95-6708-4fe5-91e4-9ea3d377588b","routeType":"prefix","time":"2022-05-09T18:00:20+02:00","message":"director found"}
{"level":"debug","service":"idm","scope":"ldap","time":"2022-05-09T18:00:20+02:00","message":"http: proxy error: dial tcp 127.0.0.1:38525: connect: connection refused"}
{"level":"info","service":"proxy","proto":"HTTP/1.1","request":"072b9b89-e029-40ee-beb7-7581dd95f25e","remote-addr":"127.0.0.1:61924","method":"PROPFIND","status":502,"path":"/remote.php/dav/spaces/1284d238-aa92-42ce-bdc4-0b0000009157$058bff95-6708-4fe5-91e4-9ea3d377588b","duration":4.445,"bytes":0,"time":"2022-05-09T18:00:20+02:00","message":"access-log"}

Setup

oCIS on curren master (today 13:00)

pascalwengerter commented 2 years ago

~Hint: Seems to fail for Golang version <1.18~ initial suspect was that a Go version bump would resolve it, for me (and others) it still fails on Go v1.18.X

micbar commented 2 years ago

The problem happens here https://github.com/cs3org/reva/blob/edge/pkg/storage/utils/filelocks/filelocks.go#L144

We need to find a solution to suppress the error.

butonic commented 2 years ago

Do we need to always supress the error, or only if we tried to delete a file that does no longer exist?

pascalwengerter commented 2 years ago

Do we need to always supress the error, or only if we tried to delete a file that does no longer exist?

From what I witness and the way I understand it the problem occurs when trying to fetch the personal space or upload files into it (also initially when no files are present, but not limited to that)

butonic commented 2 years ago

Hm, I can trigger

2022-05-10T07:08:17Z ERR failed to list storage spaces error="remove /home/vscode/.ocis/storage/users/spaces/so/me-admin-user-id-0000-000000000000/nodes/so/me/-a/dm/in-user-id-0000-000000000000.flock: no such file or directory" filters=[{"Term":{"Id":{"opaque_id":"some-admin-user-id-0000-000000000000!some-admin-user-id-0000-000000000000"}},"type":2}] pkg=rgrpc service=storage-users status={"code":15,"message":"error listing spaces","trace":"00000000000000000000000000000000"} traceid=00000000000000000000000000000000

by sending 50-100 concurrent PROPFINDS to a spaces endpoint.

for f in `seq 1 100`; do curl -o /dev/null -s -w "%{http_code}\n" -k -X PROPFIND 'https://cloud.ocis.test/dav/spaces/1284d238-aa92-42ce-bdc4-0b0000009157$some-admin-user-id-0000-000000000000' -u 'admin:admin' & ; done

Which is the locking issue mentioned in https://github.com/owncloud/ocis/issues/3749#issuecomment-1121408177

But in your case I was stumbling over http: proxy error: dial tcp 127.0.0.1:38525: connect: connection refused . @pascalwengerter do you see .flock: no such file or directory in your log when listing fails?

pascalwengerter commented 2 years ago

I posted my logs above and I don't remember seeing any .flock logs there, happy to help with debugging again later today? @dragotin and @lookacat can also reproduce it IIRC

lookacat commented 2 years ago

yes I get this error with the latest master and latest build for osx. For me it fails every time sadly :/

lookacat commented 2 years ago

First load of the page after login

2022-05-10T10:00:24+02:00 ERR error initializing metadata client error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:9215: socket: too many open files\"" service=ocis
2022-05-10T10:00:24+02:00 ERR failed to list storage spaces error="xattrs: Unable to lock file for read: could not acquire lock after wait" filters=[{"Term":{"Id":{"opaque_id":"bda94c72-625e-4a4b-80ca-d9c046534c50!bda94c72-625e-4a4b-80ca-d9c046534c50"}},"type":2}] pkg=rgrpc service=storage-users status={"code":15,"message":"error listing spaces","trace":"00000000000000000000000000000000"} traceid=00000000000000000000000000000000
2022-05-10T10:00:24+02:00 ERR error sending get quota grpc request service=graph
2022-05-10T10:00:28+02:00 ERR Could not add default role error="{\"id\":\"go.micro.client\",\"code\":503,\"detail\":\"connection error: desc = \\\"transport: Error while dialing dial tcp 127.0.0.1:9191: socket: too many open files\\\"\",\"status\":\"Service Unavailable\"}" service=proxy
2022-05-10T10:00:28+02:00 ERR failed to get roles for user error="{\"id\":\"go.micro.client\",\"code\":503,\"detail\":\"connection error: desc = \\\"transport: Error while dialing dial tcp 127.0.0.1:9191: socket: too many open files\\\"\",\"status\":\"Service Unavailable\"}" service=ocs userid=bda94c72-625e-4a4b-80ca-d9c046534c50
2022-05-10T10:00:29+02:00 ERR Could not load roles error="{\"id\":\"go.micro.client\",\"code\":503,\"detail\":\"connection error: desc = \\\"transport: Error while dialing dial tcp 127.0.0.1:9191: socket: too many open files\\\"\",\"status\":\"Service Unavailable\"}" service=proxy
2022-05-10T10:00:29+02:00 ERR Could not add default role error="{\"id\":\"com.owncloud.api.settings\",\"code\":502,\"detail\":\"circuit breaker is open\",\"status\":\"Bad Gateway\"}" service=proxy
2022-05-10T10:00:29+02:00 ERR Could not load roles error="{\"id\":\"com.owncloud.api.settings\",\"code\":502,\"detail\":\"circuit breaker is open\",\"status\":\"Bad Gateway\"}" service=proxy

this happens after i reload a couple of times

kobergj commented 2 years ago

Reason is the flock issue. Closing this in favor of https://github.com/owncloud/ocis/issues/3757

butonic commented 2 years ago

@lookacat your logs indicate a problem with Error while dialing dial tcp 127.0.0.1:9215: socket: too many open files

That is a different root cause. could you open a new issue, as https://github.com/owncloud/ocis/issues/222 is closed.

To check the current number of open file descriptors you can use the command in https://github.com/owncloud/ocis/issues/268#issuecomment-690970585

The solution might be as easy as raising the limit of open file descriptors. The microservice architecture opens quite a lot of tcp connections between different ports.

@mmattel I cannot find something on file descriptors / ulimits