Add metadata cache and propagation strategy on s3

butonic commented 5 years ago

To quickly answer which files changed we need to have an mtime and etag for directories. for s3 we cannot store matadata for keys that represent directories, because they get lost when adding a key to the prefix ... at least with minio that is the case. for local storage that supports extended attributes we can store the etag as an extended attribute. for local and s3 we need to do directory size accounting.

to enable stateless sync mtime, etag and size need to be propagated up the tree. The data needs to be stored in the storage for persistence. A cache on top can then be used to improve query speed.

This is related to being able to set arbitrary properties: https://github.com/owncloud/nexus/issues/28 not all s3 implementations allow metadata (minio does not)

So a storage needs a metadata persistence strategy / implementation? Hm, what is the cs3 api for this? AFAICT it is implicit. When executing PROPFINDS sync with the desktop clients will work if the etag changes ...

What about a propagation strategy? sync? async?

sync only for litmus?
async only for owncloud clients?
what should the default be?
- when switching to cs3 directly, any clients MUST default to async!

Tagging is modeled as a different service in cs3: https://github.com/cernbox/cs3apis/blob/master/cs3/tag/v0alpha/tag.proto AFAICT it needs an update to use CS3 References instead of filename strings.

As a cache a k/v store like https://github.com/dgraph-io/badger makes sense. Can we split the actual storage metadata from blob storage? That is kind of what is necessary for s3 if we were to use it exclusively, anyway. For now implement in local and s3, then extract common pieces?

butonic commented 5 years ago

// what is cached
// for localfs the acls / sharing permissions:
// - what did I share with whom
// - who shared what with me
// -> but this is for the share provider

// how often do we update the cache?

// what is the key?
// - the file id?
// - the path?

// do we need a fast fileid to path lookup?
// - for s3 only if we store the blobs by the fileid
// - for s3 how do we implement a tree in a kv store?
// - badger supports key iteration with prefix https://github.com/dgraph-io/badger#prefix-scans

// how can we make reva update metadata for a certain path?
// eos handles metadata itself, maybe ... what if we want to force an update?
// local/posix can use fsnotify
// s3 implementations vary:
// - minio has https://docs.min.io/docs/minio-bucket-notification-guide.html
// - aws has https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
// - ceph has http://docs.ceph.com/docs/master/radosgw/s3-notification-compatibility/

// in any case how does this affect the cache?
// - do we get all metadata to properly update the entry?
// - is it only an event that alows us to update the cache?
// -> AFAICT this is implementation specific:
//   - local only needs fsnotify to propagate the etag.
//     the fs dir entries can hold etag itself
//     (in contrast to s3 where we would have to introduce a dedicated namespace)
//     - etag as ext attr? or only for files? for folders in cache to prevent hot spot on disk?
//     - dirsum as ext attr? or only in cache?
//     - mtime for folders in cache?
//     - booting requires rebuilding cache? add a reva command for it?
//     - shares in cache? is a different service?
//     - tags as extended attributes?
//       - user defined tags vs system tags? system tags in kv store? but is a different service anyway
//     - comments? extended attributes too small
//       -> separate app that stores comments for a fileid
//       - everything is a file, store comments on filesystem so it can be eg geo distributed by eos or cephfs
//
//   - s3 is a different beast
//     - needs cache to list folders efficiently

butonic commented 5 years ago

should we add the cache to the storageprovidersvc or would that limit the integration possibilities with the actual storage implementation too much. or would it make sense to configure the kv store as a standalone service and give storages access with via api, so that the actual kv store used can be changed. eg from an embedded kv to eg redis or quarkdb?

for now: a kv cache api can be added after we implement the cache for s3. It will tell us what calls we need in the first iteration

DeepDiver1975 commented 5 years ago

kv store will be persistant or act as cache only for faster access? just asking for clearer understanding

butonic commented 5 years ago

@DeepDiver1975 short answer: it depends.

Long answer: this is storage implementation dependant. The current s3 implementation for reva assumes the data in s3 adheres to a folder structure. I am planning to implement a persistent kv based cache for the metadata to get rid of constant metadata lookups. The current s3 implementation uses no cache. It has to invent mtimes and etags for folders and defaults to the 0 timestamp. This prevents the desktop client from constantly syncing the whole tree, while at the same time allows using the web interface to navigate the s3 storage and up and download files. This is the basic storage functionality.

The next level requires adding a cache to store the metadata and a way to update the cache. It is a real cache and can always be rebuild from the s3 metadata. If the s3 product supports notifications we can update the cache an sync starts working. But that already is s3 product specific. A fallback might be a periodic scan if the admin configures it and can afford the traffic (or does not have to pay for it)

The next level would be an exclusive s3 storage where we only store the blobs in s3. Then the kv store would be the only place containing metadata. That would be the fastest solution. But now metadata and blob storage are kept separate. An option would be to store metadata as objects in s3. That might be necessary for some s3 products to implement all capabilities. minio does eg. not support tagging.

some more notes from my current dev branch:

    // first try the cache?
    // what to put there?
    // - metadata we need for propfind
    // - all we can reconstruct from the s3
    // when do we refresh?
    // - if the browser is used to get the files
    // - when the desktop polls we only use the cache
    // - when the browser checks, we go to the storage, and update the cache
    // - this needs the user agent from the original http request to be copied to the grpc request.
    // how can we manually update?
    // - a cli tool can stat a key / path in s3
    //  - if the etag is different than our cache we can propagate the change
    // - periodically scan all files?
    // should we respect cache-control headers?
    // - no ... how do we prevent requests from spamming the s3 api if someone scripts the requests and
    //   tries to ddos the service. -> rate limiting?
    // what about cache invalidition?
    // - 0 = unlimited, the default: we don't want automatic invalidation. it might cost money
    // - a day / week / month? configurable
    // - manual invalidation, so either admins or users can request a scan.
    //   - hm, that would lead to full scans, because we cannot mark a subtree as dirty ...
    // - it is rather how often do we want to update the metadata
    //   - a ttl, or
    //   - a manual update with a prefix that scans all keys with the prefix
    //     - this would allow subtrees to be updated.

butonic commented 5 years ago

Some thoughts after initial implementation work:

minio has no tagging: https://github.com/minio/minio/issues/5896#issuecomment-387271056
- this leads to the problem, that, depending on the s3 product we can or cannot store a uuid for an object. this uuid would be necessary, however to properly glue shares to objects. Even though technically s3 does not support renames the web ui certainly makes it look like it did. renaming a folder with 10k objects in it is turned into a copy & delete operation on the s3 api. unfortunately, when copying a folder object additional metadata is not copied (at least when using the aws console): https://docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html
I think for s3 it makes sense to think about different implementations:
- the current, basic storage supports browsing but not syncing
- we cannot really add a cache, because we may not be able to reconstruct the metadata from the storage:
- minio does not support tagging, for folders we could upload /key/to/dir/.metadata objects and store the uuid in there. but then we would have to sign the content to prevent users from messing up and reusing metadata ...
- we could use a different bucket to persist metadata, but then metadata cannot be reliably glued to actual keys on the storage.
sharing, if implemented as a separate service seems to look more like a view on top of s3. if you share a prefix, moving files out of that prefix will prevent sare recipients from accessing them. This is what people expeect, however, now the share becomes something like a symbolic link. and you could rename (copy & delete) the object key in s3 without the share being aware of it. this is not how we used to handle shares, but for s3 this is beginning to make sense. but shares will break if the target of the share no longer exists. They might magically start working again if a new object with the same key is created... hm.
we could store only folder metadata in the cache/ a kv store:
- in theory onlythe mtime for folders need to be stored so we can implement propagation.
- still no way to attach additional metadata to objects if the underlying s3 implementation has no tagging ... unless we can trust the notification system ... then wo could update the kv store ... well, no ... s3 has no rename, we would have to detect a copy and delete outside of oc ... seems far too fragile
the most solid approach seems to be to implement a storage that works only on the basis of metadata in a kv store with the assumption that all changes are done through reva / oc. an initial scan will fill the kv store. subsequent scans can update the kv store like the current scan command. while this step would be fragile (we may not be able to identify objects because we may not have a way to store a uuid for files and folders) the question is how likely it will be for renames to happen outside oc / reva.
- This might work better if we expose in the ui that renames are not supported and disable the functionality there. then copy and delete makes sense and shares are tied to keys instead of objects ...

butonic commented 5 years ago

minio recomments the AssumeRole API (or the relevant aws docs and the AWS access control overview) instead of object acls: https://github.com/minio/minio/issues/4496#issuecomment-417874753 they seem to be a legacy way to specify permissions even on aws

refs commented 3 years ago

@butonic can this be closed? or should be moved elsewhere? Is it still relevant?

dragotin commented 3 years ago

I think it can be closed as it is implemented by @butonic and @aduffeck. Please reopen if I am wrong.

owncloud / ocis

Add metadata cache and propagation strategy on s3 #24