owncloud / ocis

:atom_symbol: ownCloud Infinite Scale Stack
https://doc.owncloud.com/ocis/next/
Apache License 2.0
1.38k stars 181 forks source link

Upgrading from 3.0.0-rc3 to 3.0.0 causes loss of all Spaces and users don't have access to their own personal files #6518

Closed flamingm0e closed 1 year ago

flamingm0e commented 1 year ago

Describe the bug

Upgraded from 3.0.0-rc3 to 3.0.0 to see if a bug was fixed that caused multiple folders to show up after creating one.

After running the new docker container with same settings as before, my personal user has lost access to all personal files, and all Spaces have been deleted.

Steps to reproduce

Steps to reproduce the behavior:

  1. Running 3.0.0-rc3
  2. deploy new 3.0.0 container
  3. login

Expected behavior

I expect all files to still be present

Actual behavior

All Spaces were deleted, and users have no access to files.

Setup

I am using basic OCIS docker config behind Caddy 2

Additional context

Rolling back to rc3 allows me to access my user files again, but all Spaces are still missing.

kobergj commented 1 year ago

@flamingm0e I could not reproduce the behaviour. Updates from 3.0.0-rc.3 to 3.0.0 work fine.

I can however recreate the same problem when downgrading from 3.0.0 to 3.0.0-rc3. Reason for this is a breaking change in the metadata backend. See release notes:

Note

The metadata store in the DecomposedFS has changed

When you upgrade from 2.0.0 to 3.0.0-rc.1 or later and if you didn't set OCIS_DECOMPOSEDFS_METADATA_BACKEND manually, ocis will change the storage of the file metadata from using extended attributes (xattrs) to messagepack (messagepack).

This decision was made because extended attributes are limited and have some issues using shared filesystems. Messagepack is a straightforward binary format.

at least in my test 3.0.0-rc3 still uses xattrs, while 3.0.0 is using messagepack. I can see errors like

"xattr.get /.ocis/storage/users/spaces/dd/dfd866-7656-4601-86da-9146131faaae/nodes/dd/df/d8/66/-7656-4601-86da-9146131faaae user.ocis.name: no data available"

in the ocis logs after the downgrade.

Maybe you have a similar issue? Do you have envvar OCIS_DECOMPOSEDFS_METADATA_BACKEND set in your enviroment?

flamingm0e commented 1 year ago

Those errors do exist in my logs.

I do not have any of the 3 envvars that are listed in the upgrade documentation. I guess that since rc3 was still using xattrs, I need to go through the same steps as if I were upgrading from 2.0 to 3.0?

kobergj commented 1 year ago

Shouldn't be necessary if you had a running 3.0.0-rc.3. But I'm wondering why it failed on the upgrade for you.

Could you try running 3.0.0 with OCIS_DECOMPOSEDFS_METADATA_BACKEND="messagepack"?

flamingm0e commented 1 year ago

Tested with instructions from upgrade documentation, and with OCIS_DECOMPOSEDFS_METADATA_BACKEND="messagepack" and same problem. When trying to login from the app on my Android device, it never authorizes, and says "it was not found". The logs on server state

{"level":"error","service":"idp","error":"ldap identifier backend get user error: user does not exist or too many entries returned","time":"2023-06-14T14:44:13.172535732Z","line":"github.com/owncloud/ocis/v2/ocis-pkg/log/logrus_wrapper.go:50","message":"IdentifierIdentityManager: fetch failed to get user from userID"}

But I can login from web browser. Logging into web browser, I have no personal files, no shares, and no spaces available.

{"level":"error","service":"proxy","error":"gateway: grpc failed with code CODE_INTERNAL","time":"2023-06-14T14:44:15.115376744Z","line":"github.com/owncloud/ocis/v2/services/proxy/pkg/middleware/create_home.go:74","message":"error when calling Createhome"}
flamingm0e commented 1 year ago

WOW.

Reverted back to rc3 again....now EVERYTHING is gone, except my users. FML. I guess I get to restore user files from my RCLONE jobs now.

kobergj commented 1 year ago

You need to change OCIS_DECOMPOSEDFS_METADATA_BACKEND to xattrs if you want to use rc3 again. My guess is that something went wrong in metadata migration.

You can also try running 3.0.0 with xattrs backend

flamingm0e commented 1 year ago

and now it won't start.

kobergj commented 1 year ago

Which version didn't start? And in which configuration?

mmattel commented 1 year ago

The upgrade documentation strongly recommends to do a full backup first, which is neessary to avoid data loss when a revert is needed. Partial reverting is not allowed nor possible. Could you do a full restore from backup, follow the upgrade guide and tell where the issue occurs?

flamingm0e commented 1 year ago

I'm rolling back my zfs snapshot to last known good working configuration. I will begin testing from there.

flamingm0e commented 1 year ago

Rolling back to a ZFS snapshot has gotten me back to normal.

I will snapshot and proceed to try another upgrade, utilizing the ENVVARS as discussed.

kobergj commented 1 year ago

:+1: Probably something went wrong during xattrs->messagepack migration. It happens automatically when you start 3.0.0 for the first time. Keep an eye on the logs maybe it tells us what went wrong the first time.

flamingm0e commented 1 year ago

Same errors exist in the logs on the upgrade. Same problems with missing everything on user and Spaces.

At this point, it's probably better to just start over. This is frustrating.

kobergj commented 1 year ago

Which error are you talking about? What happens on initial start of 3.0.0 ? You should see a log like

"Migrating to messagepack metadata backend..."

probably followed by an error.

flamingm0e commented 1 year ago

Which error are you talking about? What happens on initial start of 3.0.0 ? You should see a log like

"Migrating to messagepack metadata backend..."

probably followed by an error.

The same errors I previously mentioned:

https://github.com/owncloud/ocis/issues/6518#issuecomment-1591371900

I never saw anything about messagepack, but that could be because as soon as it started, it started the other 2 errors and spammed everything.

kobergj commented 1 year ago

I never saw anything about messagepack, but that could be because as soon as it started, it started the other 2 errors and spammed everything.

Strange. It shouldn't try to get an user when nobody tries to login.

{"level":"error","service":"idp","error":"ldap identifier backend get user error: user does not exist or too many entries returned","time":"2023-06-14T14:44:13.172535732Z","line":"github.com/owncloud/ocis/v2/ocis-pkg/log/logrus_wrapper.go:50","message":"IdentifierIdentityManager: fetch failed to get user from userID"}

This is the important error, the other one is just a follow-up. Does it start to throw this error directly when you start ocis? I mean before you login?

flamingm0e commented 1 year ago

It is likely throwing the error immediately because I have an RCLONE webdav connection to my server, and I have the app on my phone, so multiple devices trying to login immediately when it comes back online. I would have to pull my devices off, wife, kid, etc, so that nothing is trying to login to quiet the logs. It's a huge coordinated effort, sadly. I'll see if I can find some time to run through the scenarios this weekend.

All I wanted was a self hosted Google Drive replacement...and I love the simplicity of configuring OCIS (and the speed without the overhead of all the different components), but man, when it fails, it fails.

kobergj commented 1 year ago

I see. This could be the problem. If someone logs in during xattrs -> messagepack migration it could potentially break the respective user.

I'll try to reproduce this tomorrow...

flamingm0e commented 1 year ago

I may have time tomorrow as well to try and get everyone logged out and try again.

kobergj commented 1 year ago

I could not reproduce so far by spamming the server with requests. But I still think something goes wrong during migration because I can reproduce exact same behaviour by just skipping migration.

If you don't want to log out all your users you can try running ocis with OCIS_RUN_SERVICES: "storage-users,nats". This will only start storage-users service (which does the migration). Your clients still can't connect as proxy is not running. Maybe we can see migration logs then, they should tell us what went wrong.

If there are no errors, remove the envvar and restart ocis. If the problem is still there it was not the messagepack migration.

flamingm0e commented 1 year ago

OK. I configured that ENVVAR and started it up (after doing a snapshot with OCIS off, of course)

I finally see the migrating to messagepack message.

{"level":"info","root":"/var/lib/ocis/storage/users","time":"2023-06-16T12:21:49.441395153Z","caller":"github.com/cs3org/reva/v2@v2.14.0/pkg/storage/utils/decomposedfs/migrator/0003_switch_to_messagepack_metadata.go:45","message":"Migrating to messagepack metadata backend..."}

That seems like a good sign that I can actually see that, and not the errors spamming the logs. That's helpful, thank you.

How long would this process take? I only have a couple hundred GB of data in there, but lots of small files. Should I see a message when completed that it's done?

kobergj commented 1 year ago

Yes there should be a log like

{"level":"info","time":"2023-06-16T12:30:05.267986656Z","caller":"github.com/cs3org/reva/v2@v2.14.0/pkg/storage/utils/decomposedfs/migrator/0003_switch_to_messagepack_metadata.go:106","message":"done."}`

Unfortunately I can't give you an ETA. But the size of the files doesn't matter. Only the amount is relevant as it needs to rewrite metadata for each.

flamingm0e commented 1 year ago

perfect. Thank you.

I will wait and see what happens.

flamingm0e commented 1 year ago

Thank you for all your assistance.

It appears the cause of my problems was that I didn't know when migration was working, or complete. After over 5 hours, it finally finished, I removed that ENVVAR and fired it back up normally. Everything is as it should be at this time. I thought I was going to have to start over.

mmattel commented 1 year ago

@flamingm0e thanks for your input. I will asap file an update to our upgrade guide with the info provided.

mmattel commented 1 year ago

Addon, mind to tell how many files have been affected for upgrading?

flamingm0e commented 1 year ago

I have a ton of files. I don't know how many. Is there a quick way to figure that out?