owncloud / ocis

:atom_symbol: ownCloud Infinite Scale Stack
https://doc.owncloud.com/ocis/next/
Apache License 2.0
1.4k stars 183 forks source link

CreateHome fails #2995

Closed kulmann closed 2 years ago

kulmann commented 2 years ago

Describe the bug

Sometimes ™️ the CreateHome grpc request fails on first user login. As a result the logged in user sees a Resource not found page, which is yielded by a 404 from PROPFIND on the user home. What I noticed in the filesystem is that the following folder structure was created, with a symlink to a non-existing folder:

.
├── blobs
├── nodes
│   └── root
│       └── ddc2004c-0977-11eb-9d3f-a793888cd0f8 -> ../ddc2004c-0977-11eb-9d3f-a793888cd0f8
├── spaces
│   ├── personal
│   └── share
├── trash
└── uploads

Steps to reproduce

Steps to reproduce the behavior:

  1. unknown, seemingly random

A hunch that I thought I had, but still couldn't reproduce it in a stable way:

  1. delete ~/.ocis
  2. start ocis single binary (in my case it was with a WEB_ASSET_PATH set to a locally build web): WEB_ASSET_PATH=/path/to/web/dist OCIS_INSECURE=true ./bin/ocis server
  3. try to login via https://localhost:9200 (failing)
  4. have a 2022/01/20 10:23:35 [ERR] mdns: Failed to bind to udp port: listen udp4 0.0.0.0:0: socket: too many open files listen udp6 [::]:0: socket: too many open files
  5. restart ocis with same command as above
  6. try to login again (succeeding)
  7. see the Resource not found page as described above
  8. see the ocis error log as quoted below

Again, I don't

Expected behavior

User home gets provisioned reliably

Actual behavior

User home sometimes doesn't get provisioned

Setup

oCIS single binary from current master at fdb2f55b77532c1ea52452664d8f7577e2bf8805

Additional context

oCIS error log

2022-01-18T10:48:44+01:00 ERR failed to create storage space error="Decomposedfs: root has no parent" pkg=rgrpc request={"name":"Admin","opaque":{"map":{"space_id":{"decoder":"plain","value":"ZGRjMjAwNGMtMDk3Ny0xMWViLTlkM2YtYTc5Mzg4OGNkMGY4"}}},"owner":{"display_name":"Admin","gid_number":30000,"groups":["users"],"id":{"idp":"https://localhost:9200","opaque_id":"ddc2004c-0977-11eb-9d3f-a793888cd0f8","type":1},"mail":"admin@example.org","mail_verified":true,"opaque":{"map":{"roles":{"decoder":"json","value":"WyI3MTg4MTg4My0xNzY4LTQ2YmQtYTI0ZC1hMzU2YTJhZmRmN2YiXQ=="}}},"uid_number":20004,"username":"admin"},"type":"personal"} service=storage status={"code":15,"message":"error listing spaces","trace":"00000000000000000000000000000000"} traceid=00000000000000000000000000000000
2022-01-18T10:48:44+01:00 ERR error when calling Createhome error="gateway: grpc failed with code CODE_INTERNAL" service=proxy

OS

macOS 11.5.2 I only have 256 file descriptors configured, but then again, I'm not even sure that the too many open files mdns error is related in any way. Of course oCIS is not fully functioning when I have the error, but I'm not sure if this is the thing that leads to the error with the failing CreateHome.

wkloucek commented 2 years ago

Also happens on continuously deployed demo instances (eg. https://ocis.ocis-web.latest.owncloud.works/).

I did a dump of the user storage (storage-users.zip), that is similar to @kulmann s example, that can be use to reproduce the state oCIS is in after the CreateHome didn't succeed... Though it doesn't bring us futher to the answer why this happens at all.

tree ~/.ocis/storage/users/
/home/kloucek/.ocis/storage/users/
├── blobs
├── nodes
│   ├── 9bde6eee-1537-4228-a699-0a2afb6accc7
│   ├── ddc2004c-0977-11eb-9d3f-a793888cd0f8
│   └── root
│       ├── 4c510ada-c86b-4815-8820-42cdf82c3d51 -> ../4c510ada-c86b-4815-8820-42cdf82c3d51
│       └── ddc2004c-0977-11eb-9d3f-a793888cd0f8 -> ../ddc2004c-0977-11eb-9d3f-a793888cd0f8
├── spaces
│   ├── personal
│   │   └── ddc2004c-0977-11eb-9d3f-a793888cd0f8 -> ../../nodes/ddc2004c-0977-11eb-9d3f-a793888cd0f8
│   └── share
├── trash
└── uploads

12 directories, 1 file
wkloucek commented 2 years ago

@butonic can you please have a look again?

kulmann commented 2 years ago

@wkloucek did you see any too many open files log lines on the demo deployments? Would like to get a feeling if that's related or unrelated.

wkloucek commented 2 years ago

Possible fix: https://github.com/cs3org/reva/pull/2457

pascalwengerter commented 2 years ago

Side note: Started the WOPI deployment example today and ran into this issue multiple times

wkloucek commented 2 years ago

Side note: Started the WOPI deployment example today and ran into this issue multiple times

which version of oCIS did you use? (if latest, could you please post your docker image digest?)

pascalwengerter commented 2 years ago

Side note: Started the WOPI deployment example today and ran into this issue multiple times

which version of oCIS did you use? (if latest, could you please post your docker image digest?)

sha256:55f2b4d71249d06d45241b83780a396eea83aa6c1d906fbbd1e5f90bf9e373e9

wkloucek commented 2 years ago

55f2b4d71249d06d45241b83780a396eea83aa6c1d906fbbd1e5f90bf9e373e9

did you start from scratch (no volumes present before) or did you have pre-existing volumes? You can check that by running docker inspect ocis_web_ocis-data (which should also be name of the volume on your side). It will stat the creation date in the CreatedAt field.

Background: It was potentially fixed in https://github.com/cs3org/reva/pull/2457. It is already included in your image but if you have a old volume, the bug will be persisted forever...

pascalwengerter commented 2 years ago

55f2b4d71249d06d45241b83780a396eea83aa6c1d906fbbd1e5f90bf9e373e9

did you start from scratch (no volumes present before) or did you have pre-existing volumes? You can check that by running docker inspect ocis_web_ocis-data (which should also be name of the volume on your side). It will stat the creation date in the CreatedAt field.

Background: It was potentially fixed in cs3org/reva#2457. It is already included in your image but if you have a old volume, the bug will be persisted forever...

I thought I had removed the volumes before starting the newly pulled images, but "CreatedAt": "2021-10-14T09:50:18+02:00" suggests otherwise 🤔 weirddd

pascalwengerter commented 2 years ago

55f2b4d71249d06d45241b83780a396eea83aa6c1d906fbbd1e5f90bf9e373e9

did you start from scratch (no volumes present before) or did you have pre-existing volumes? You can check that by running docker inspect ocis_web_ocis-data (which should also be name of the volume on your side). It will stat the creation date in the CreatedAt field.

Background: It was potentially fixed in cs3org/reva#2457. It is already included in your image but if you have a old volume, the bug will be persisted forever...

Had some issues because volumes seemed not to be deleted the way I expected them to...now they're all fresh but the problem persists, yes

pascalwengerter commented 2 years ago

Just set up a new machine (and both pulled fresh docker images & compiled oCIS locally), still an issue

rpocklin commented 2 years ago

I have the issue with current owncloud/ocis:latest build (image id: 70d9f96b99e6) - seems to occur inconsistency (eg every 3 or 4th time when you provision a new docker image, even when deleting volumes).

After creating the container and logging in, (I was using admin:admin), every 3rd or 4th time the API request fails with a 404 (proxy makes request to storage-frontend which fails).

Screenshot 153524529-30244b9d-ac53-4acc-be89-8c4f40fa002f

Error Log (in container)

{"level":"debug","service":"storage","pkg":"rgrpc","traceid":"00000000000000000000000000000000","user-agent":"grpc-go/1.44.1-dev","from":"tcp://127.0.0.1:48818","uri":"/cs3.storage.registry.v1beta1.RegistryAPI/GetStorageProviders","start":"11/Feb/2022:01:30:07 +0000","end":"11/Feb/2022:01:30:07 +0000","time_ns":103300,"code":"OK","time":"2022-02-11T01:30:07Z","message":"unary"}

{"level":"error","service":"storage","pkg":"rgrpc","traceid":"00000000000000000000000000000000","error":"Decomposedfs: root has no parent","status":{"code":15,"message":"error creating space","trace":"00000000000000000000000000000000"},"request":{"opaque":{"map":{"space_id":{"decoder":"plain","value":"ZGRjMjAwNGMtMDk3Ny0xMWViLTlkM2YtYTc5Mzg4OGNkMGY4"}}},"owner":{"id":{"idp":"https://ocis.owncloud.test","opaque_id":"ddc2004c-0977-11eb-9d3f-a793888cd0f8","type":1},"username":"admin","mail":"admin@example.org","mail_verified":true,"display_name":"Admin","groups":["users"],"opaque":{"map":{"roles":{"decoder":"json","value":"WyI3MTg4MTg4My0xNzY4LTQ2YmQtYTI0ZC1hMzU2YTJhZmRmN2YiXQ=="}}},"uid_number":20004,"gid_number":30000},"type":"personal","name":"Admin"},"time":"2022-02-11T01:30:07Z","message":"failed to create storage space"}

{"level":"debug","service":"storage","pkg":"rgrpc","traceid":"00000000000000000000000000000000","user-agent":"grpc-go/1.44.1-dev","from":"tcp://127.0.0.1:43778","uri":"/cs3.storage.provider.v1beta1.ProviderAPI/CreateStorageSpace","start":"11/Feb/2022:01:30:07 +0000","end":"11/Feb/2022:01:30:07 +0000","time_ns":127600,"code":"OK","time":"2022-02-11T01:30:07Z","message":"unary"}

{"level":"debug","service":"storage","pkg":"rgrpc","traceid":"00000000000000000000000000000000","user-agent":"grpc-go/1.44.1-dev","from":"tcp://127.0.0.1:48750","uri":"/cs3.gateway.v1beta1.GatewayAPI/CreateHome","start":"11/Feb/2022:01:30:07 +0000","end":"11/Feb/2022:01:30:07 +0000","time_ns":1073000,"code":"OK","time":"2022-02-11T01:30:07Z","message":"unary"}

2022-02-11T01:30:07Z ERR error when calling Createhome error="gateway: grpc failed with code CODE_INTERNAL" service=proxy

2022-02-11T01:30:07Z DBG director found path=/remote.php/webdav/ policy=ocis prefix=/remote.php/ routeType=prefix service=proxy

{"level":"debug","service":"storage","pkg":"rhttp","traceid":"00000000000000000000000000000000","time":"2022-02-11T01:30:07Z","message":"access token is already provided"}

I did not see any error relating to too many open files in my case (running in Ubuntu under WSL2 in Windows 10).

See https://github.com/owncloud/ocis/issues/3152 for original (probable duplicate) bug report.

nealey commented 2 years ago

I seem to be able to reproduce this 100% of the time if the backing filesystem is a tmpfs.

For example, with the following argument to docker run

-v /tmp/ocis/data:/var/lib/ocis
butonic commented 2 years ago

@nealey IIRC tmpfs does not support extended attributes in the user.* namespace, which is required by the decomposed filesystem ocis uses by default: https://cateee.net/lkddb/web-lkddb/TMPFS_XATTR.html Try using a different filesystem, or use a docker volume.

butonic commented 2 years ago

Reading this thread it seems to either be the default (low) ulimit on macos or conservative oses, or a filesystem without extattr support. There have also been a lot of fixes with regards to file locking and atomicity, so this should no longer be a problem. Closing. Please open a new issue if you think you are still encountering this.