Open alexdutton opened 4 years ago
A couple of comments about some of the design choices:
storage_class
was originally intended as a property to be interpreted by the storage factory, and used for e.g. in the same storage system having offline/online files or reduced redundancy. This way, an integrity checker would know that it had to issue a fetch to get the offline file first, before being able to check integrity. Similarly, you might optimize storage costs, but having fewer replicas. That said, it never really got into use and is now always the same. So probably it should either be removed, or we need a real use case.
FileInstance
is storing the full URI to a file. It was a deliberate design choice to not tie it to a specific location, to ensure that files were not bound to a specific location. E.g. you could point a FileInstance URI into an old file hierarchy from another repository.
Location
was by design only meant as a way seed the initial location of a file (because we wanted a FileInstance to point directly to the file).
PyFilesystem is a storage abstraction layer that's able to access multiple backends, which makes it a bit confusing also having Files-REST FileStorage
class and a property on the db models called storage_class
. Thus a bit of clarification:
storage_class
is not actually pointing to e.g. a specific FileStorage class, but it was supposed to be interpreted by an implementation of FileStorage
.FileStorage
class defines the interface by which Invenio can get access to bits and bytes and defines the minimal operations (like open
, send_file
, checksum
, copy
).PyFSFileStorage
. This makes it easy to provide access to multiple backends, without having to implement a new Invenio FileStorage
class for each backend. However, implementing a new PyFilesystem backend is a big task that also involves e.g. implementing listdir
and other tons of other methods we don't need. Also, some PyFilesystem implementations are not very efficient, in that they store a file in temporary space first, before sending it to remote storage (this essentially kills performance and possible makes a server run out of memory for GB/TB sized files).I think the issue you point to is that it's hard today in Invenio to use multiple Invenio FileStorage
classes. You would have to implement your own custom storage factory, that inspects the URI. I'm not sure how an ideal solution looks like, that we can discuss further IRL.
As I understand it, the current design/implementation has the following:
FILES_REST_STORAGE_FACTORY
, the provided implementation beingpyfs_storage_factory
. As far as I can tell, this implementation isn't particularly/actually PyFS-specific.FILES_REST_DEFAULT_STORAGE_CLASS
)Bucket.default_storage_class
,FileInstance.storage_class
(both adb.Column(db.String(1))
, which implies it's never been used? see source)FileInstance.uri
. The location is not used after that, and so file URIs for filesystem-stored files are absolute paths on disk.invenio_files_rest.tasks:merge_multipartobject
callsmp.merge_parts()
, which has a**kwargs
that gets passed all the way through toFileInstance.storage()
, which defaults tostorage_class
being the one specfied byFILES_REST_DEFAULT_STORAGE_CLASS
. In this scenario, thestorage_class
on the FileInstance gets ignored.This seems a bit complicated, and doesn't yet necessarily meet its objectives of having a tidy interface and support for multiple storage backends.
What I would have expected:
Location
defines a storage class and a base path, e.g. (abstractly)(pyfs, /some/path)
,(s3, /bucket-name/some-path)
or(s3bucket, /some/path)
Bucket
provides for a default location for new files, defaulting to the global default locationFileInstance
has aLocation
and a path. The location defaults to the bucket at creation time. The base path for the bucket and the path for the file instance are combined and used by the storage backend defined on the locationFileInstance.storage()
takes no**kwargs
, and creates the storage class instance based on the above. Maybe it's a (cached?) property. Its behaviour is contained in a function that replaces the storage class factory and can be overridden, but generally an implementer would leave this alone unless they want to do something special.This GitHub issue came about because I was trying to work out how one would support hooking up to multiple S3 buckets, and my assumption was that each would be a location for each bucket, but it wasn't immediately clear how to integrate this into the existing framework.
Could we improve on how this all works before we have a public release?