Closed mateusz closed 8 years ago
@dhensby This RFC is targeting branch 4 now (next major) :-)
If a DO stores a tuple of a file and I want to link the same file to another DO.
It's effectively a different file resource, because the tuple is different (specifically, the ParentObject). However see the comment above: it might still be stored as a single data blob in the backend, because some APLs will ignore ParentObject reference. So: it is different from the Framework perspective, but might be the same piece of data from the APL perspective.
Well, I guess that ties back to, how do we compare different ParentObjects for uniqueness. I don't know yet, it's a conversation that will get resolved when we try to do some coding :-)
Versioning is up to the higher-level systems (i.e. Versioned extension). Different tuple means a different file resource, so as long as it differs between versions (i.e. by Hash or Filename or whatever), you can accomplish versioning.
The Files&Images section will (hopefully) work as it currently does - we won't actually be implementing versioning there, or adding draft support. The File
hierarchy will still be our "central filestore", except that now it will delegate data manipulation to the APL. There will still be no symlinking capability. If you create a File
, it will be refer to the data via the FileReference
. If you update a File
, it will need to decide if the FileReference
changed.
Would it be more clear if the FileReference
was called DataReference
?
Cam published the meetup talk about this RFC. Excuse the accent :-)
Maybe ContentReference
? Extrapolating things out, conceivably you could use a tuple format to reference any kind of content, but DataReference
feels a bit too pigeon-holed.
Other than versioning and private files: is there anything Flysystem + relative database file paths do not handle? Trying to understand the needs/priorities for the added complexity. :)
@assertchris by definition, any asset persistency system that needs/wants to use hashes :-) But versioning is important - it's a basic premise of this RFC. If that premise is challenged, a new RFC needs to be written :-)
@mateusz Out of my depth, in here. Moving along... :)
So besides arguing about which semantic version this goes into or what the low level API is no-one has been screaming blue murder?
No murder. Even better: the core-committers approved this, so the games may begin. I mean implementation. I hope we can start implementing this soon! Yay! :fireworks:
One query from the implementation side of things... is it envisaged that there will be a set of explicit interface
s defined for a concrete implementation to be written against, or an implied interface set based on an initial concrete implementation?
No updates for a bit, is this underway or on deck soon? I think it would increase SS adoption significantly since horizontal scaling is quite important and tricky without this.
@clyonsEIS I've clarified the milestones and added the pretty multicolour labels, so anyone coming from UserVoice can quickly see the status.
https://silverstripe.uservoice.com/forums/251266-new-features/suggestions/6572660-asset-abstraction
I've created a PR for initial implementation of this feature (not including updating of File dataobject). Or things like file variants.
https://github.com/silverstripe/silverstripe-framework/issues/4599
Second pull request is at https://github.com/silverstripe/silverstripe-framework/pull/4652
@tractorcow I think this has now been completed?
Yes this is done!
Note to readers: the body of this RFC contains the current proposal, fully updated based on feedback. It's not necessary to read the comments, however they might aid your understanding and contain answers to some of your questions.
RFC-1 Asset abstraction
Purpose and outcome
This RFC proposes a change to the SilverStripe Framework to add flexibility to the asset subsystem.
Community is welcome to give feedback. The proposal will be submitted to the core-committers for approval. We expect it will evolve further during development, and it will eventually form a basis for an architecture document.
Motivation
The driver for this change is solving the following problems we keep encountering in practice:
We currently cannot accomplish this due to the monolithic approach to the file handling in the Framework. This capability is currently expressed as a set of specialised
DataObject
s:File
,Folder
andImage
which implicitly tie the database representation to the local filesystem, preventing any flexibility.Additionaly, we self-impose the following requirements which we see as important:
Proposal
To solve these problems we propose to:
DataObjects
) to model properties (DBFields
).Specifically, we would do so by introducing a new concept of a generalised file reference - a set of values (i.e. a tuple) uniquely pointing to a file regardless of the storage backend in use.
Then we would change the model representation of the files: instead of using the coarse-grained data objects such as
File
we would move to using the$db
fields. For this a newDBField
would be introduced, called FileReference.The
File
andFolder
would retain their APIs as much as possible, but be retrofitted to use theFileReference
. They would now conceptually become an inseparable part of the Files & Images subsystem of the CMS.UploadField
would be updated to be able to work directly with theFileReference
s, additonally to the existing has_one support.Finally, we would remove all direct filesystem operations from the Framework and place them behind the new Asset Persistence Layer (APL) interface. This RFC does not define how an APL should work under the hood. Neither does it define the architecture nor interfaces of the storage backends. APL is tied to data stored by that APL, and changing the APL will cause data loss without a migration script.
This RFC is targeted at the next major release.
File reference tuple
File reference is a set of values (i.e. a tuple) describing a file uniquely, and independently of a storage backend. It must be generalised to allow different kinds of backends. We envisage the following values to be included in this tuple:
File
referenceThe available values are not extensible. Backends may use any or all values of the tuple.
The tuple itself is not represented in the codebase directly as a class or an object, but manifests in the
FileReference
as stored fields and in the discrete APL operation parameters and return values.FileReference
FileReference
is a mechanism for expressing the file reference in the code and for storing it in the database. It would most likely be implemented as a packed or compositeDBField
, stored in table columns. The ParentObject does not need to be stored because it can be derived from the containing data object.Additionally, the
FileReference
acts as a convenient proxy object to the APL interface, since it has the ability to introspect all the necessary values (APL may still be used directly, albeit more verbosely). By necessity theFileReference
s API it would follow the APL interface as laid out in "Appendix A: APL interface".FileReference
will deal internally with some of the plumbing required by the APL, such as passing the ParentObject reference, or storing the returned, possibly updated, tuple values (see the "Asset Persistence Layer (APL)" chapter for code samples).FileReference can be attached to any DataObject - not only to
File
s.If you want specific values in the tuple, you will need to initialise them before storing any data. As an example, you might want to give the APL a hint that you require a specific Filename.
See the "Appendix B: FileReference class mockup" for the proposed layout of that class.
Asset Persistence Layer (APL)
All existing direct file operations in the Framework would be rewritten to use the APL, either indirectly through the
FileReference
, or directly. A concrete APL would be configured once per site using a dependency injection.APL requires all File Reference tuple values to be passed to all its operations.
Expanding on the previous example:
See the "Appendix A: APL interface" for the proposed interface to the APL.
Using parameters is at the APL discretion - it would be perfectly legal for an implementation to ignore some of the values as long as it can ensure uniqueness.
Additionally, APL setters may modify the tuple values passed to it to ensure consistency. Callers have to inspect the return value and make sure to update their understanding of the file accordingly. This would for example be used in the situation of filename conflicts. Also see the "Conflict resolution" chapter below.
A special case of a tuple rewrite is when a caller passes "null" as one of the tuple values. Caller can then expect the APL implementation to generate the value. This would be used when creating new files to obtain a hash.
Internally, APL's storage format should be treated as proprietary to that APL. APL's are interchangeable in terms of the API, but not the data already stored - a data migration script would be necessary for that.
See "Appendix C: Trivial APL method" for some contextual pseudocode.
Storage backends
This RFC does not define how a concrete APL would work under the hood. The architecture and the interfaces of the storage backends is left to the implementor.
That said, we expect the Flysystem will be useful for this.
Simple APL
A default, mostly-backwards-compatible implementation of the APL would be built as part of this RFC. With this APL, it would be possible for the apparent filesystem on disk would remain as it currently is to allow easier migration of the existing environments.
The Hash value in the tuple would be ignored by this backend and most likely Flysystem will be used as backend.
Although the Simple APL would support different backends through the Flysystem, because of the problems described in the "Asynchronous APL API" and the "Performance" chapters, we wouldn't recommend using the S3 backend here.
Other considerations
Handling directories
A directory (a "folder") is a concept that is not reflected in the APL interface. APL operations can only handle file references.
Under the hood however, a concrete APL may use the tuple values provided in the method calls to reconstruct the directories (specifically, the ParentObject might be used to recover the structure). It is entirely up to the implementation to handle this.
On the other end of the spectrum, the Files & Images' notion of "folders" will remain expressed as
Folder
data objects, but theirFileReference
would be set to null (again, that's because the APL does not handle folders explicitly).Conflict resolution
A conflict may occur if an attempt is made to write data to a tuple which already exists. The APL caller can pass their resolution preference though the
$conflictResolution
parameter.The APL may support the following resolution approaches:
Derived files
The
Image
class provides file processing services. Files derived in this way are not represented in the database and are currently written to disk to an arbitrary location by theGDBackend
.To keep this RFC focused on the file abstraction, we propose to do the minimum adjustment needed to remove the direct filesystem operations while avoiding refactoring of the processing subsystem.
The
FileReference
would allow us to obtain a derived file reference (which is not stored in the database).Note the APL should not change tuple values for derived files because we have no place to store them on the consuming side.
Images in content
These will need to be rewritten to use shortcodes. Direct file references by path are no longer valid.
Rationale
How does this solve the mentioned problems
Changing root path for asset storage
With all filesystem operations abstracted away, the default APL can include a configuration parameter for setting the filesystem root path.
Amazon S3 and clustered sites
The access pattern used by an APL can be non-trivial. An example of a more sophisticated approach to persistency is a load-balanced site using S3 as a primary storage, and the filesystem as a secondary storage. Such a site would have the following problems:
Here is a sequence diagram illustrating a cascading operation of such an APL. The backends are internal to the APL and invisible to the userspace code.
Versioned files
The APL does not deal with file versions, but it has the potential to store many distinct files with the same name thanks to the presence of the Hash in the file reference tuple.
The
FileReference
serialises to the database as strings, so it may versioned just as any other database field. This paves the way to usingVersioned
extension directly onFile
.Private files
Once we switch to the File References, using
File
objects to manage files would no longer be obligatory. It would now be possible to bypassFile
and "attach" files to arbitraryDataObject
s using the File ReferenceDBField
directly.This would mean opting out of the usual functionality - the file would no longer appear in the "Files & Images" section. Custom methods to manipulate the assets would need to be built, or the
UploadField
could be used to directly interface with theFileReference
s.Files with workflow
Essentially a mix of the file versioning and private files approach could be used to accomplish the workflow. Since files are now treated as model properties, we have a freedom to create new models and interfaces. We can also easily move files between the models by copying the tuple - e.g. from the "Workflow" model into the "Files & Images".
Alternative proposals
Common objections
Deferred & rejected features
Deferred: Rewriting derived files subsystem into a separate API
Deferred to keep this RFC focused. See "Derived files" chapter.
Rejected: Versioned APL
We have decided the APL is not the right layer to implement file versioning and that it should be implemented on a higher level. This RFC will make it possible to accomplish both versioning folders and files.
See "How does this solve the problems" chapter.
Rejected: Tuple as an object
This would be too complex and ultimately make the conceptual model too convoluted.
The drawback of not doing this is that in this proposal the
FileReference
must act both as the tuple representation in code (with all tuple manipulation methods), and as a database field representation.This violates the single-responsibility principle and leads to edge-case problems such as the one presented in the "Derived files" chapter, where a crippled
FileReference
is produced that does not represent any database entity (which is counter-intuitive, because it's aDBField
after all).Rejected: Extending the tuple with more values
We think the tuple is such a fundamental representation of the file resource that it should be fixed (just as a
Varchar
representation is fixed). Tuple was designed such that it should contain enough specificity to allow for all use cases listed in "Motivation" chapter.Allowing to extend it with custom values would cause module compatibility problems, and we want the APLs to be interchangeable.
Rejected: Leaving derived files subsystem as it is (with direct local filesystem operations)
This wouldn't work for clustered sites. File derivation isn't always triggered on demand, which means some derived file will only be available on some local filesystem. Passing the derived files into the APL makes sure the files are always treated consistently.
For the same reason this won't work for deployments where there is no local persistent storage available (e.g. AWS deployment utilising only S3).
Rejected: Asynchronous APL API
Writes to a backend might take long time, it's unclear what a "rename" or "exception" conflict resolution should do if we execute subsequent write to the same resource reference without waiting for completion. In this case we could end up losing data.
To solve this we could introduce an asynchronous API for the APL. A complete tuple corresponds to a single write operation, so we could possibly add a "getWriteStatus()" method to the APL. It could then return "in progress", "success", "failure". a UI could poll that until the result was success; if it returned failure then the rename/replace/retry logic could be put in place.
However this can also be solved by using an APL that supports content hashes (there woudln't ever be any conflicts in such an APL) so we decided that it's not worth introducing the additional complexity related to the asynchronous API.
It's worth noting the Flysystem developers discarded asynchronous operation in their API.
Impact
Backwards-incompatible changes
If using the default Simple APL, we are aiming at minimising the backwards-incompatible changes when compared to "3.x" Framework branch. There will certainly be a DB schema change to
File
.For the Simple APL with non-default configuration, for other APLs, and for future major Framework versions the following changes are expected:
File
,Image
andFolder
(e.g. removal of Filename field).ASSETS_DIR
andASSETS_PATH
removed.Security
The significant impact is that until the "secureassets" module is rewritten to support the APL and to work with the tuples there won't be any easy way to support secure files.
From the user perspective, there won't be any changes in how
File
objects handle security. AssetAdmin, GridField and Subsites should not be impacted.The real potential for improvement lies in custom APLs which will be able to secure files in ways different from the
.htaccess
approach. TheFileReference
won't have any inherent notion of security and it will be up to the implementors of the APL to derive the security properties from the file reference tuple (especially the ParentObject).For example, APL, while performing a write operation, could distinguish between secure (proxied) and insecure (directly accessible) assets by inspecting the ParentObject. It'd then store the resource in an appropriate backend. It would be up to the "secure" backend to ensure the authorisation is done upon a request, e.g. by passing the request through a Controller.
Performance
The Simple APL will write to the local filesystem, so we expect the performance not to be different than before the change. The APL method calls will be synchronous.
However a special consideration needs to be given when writing APLs with long-running operations. The APL layer does not inherently support asynchronicity: writing data asynchronously leaves the backend in an unspecified state and the API does not make any provisions for introspecting this state. Instead we recommend using content hashes in your APL (see also the "Asynchronous APL API" chapter) - this allows the APL to avoid tuple clashes and unexpected overwrites of content.
Another thing to keep in mind for custom APLs is that the presence of long-running synchronous operations will impact the UI considerably. One solution here could be to execute some part of the write synchronously (i.e. to the local filesystem), but delegate the long-running writes to a queue.
Scalability
This change is a prerequisite for better scalability. Shifting the assets off the local machine is required for any kind of clustered or scalable environment. That said, the work included in this RFC will not affect the scalability directly - the Simple APL will predominantly be targeted towards the local filesystem.
Maintenance & operational
Backups
With Simple APL, the backups will be done as before: a database dump and the filesystem snapshot are enough to restore all data.
However backups may not be trivial when other APLs are used. People performing maintenance will need to know how to find out which APL is in use, and act accordingly.
Migration
With Simple APL the filesystem representation will remain in place. However on the database level the migration from older sites will not be a straighforward copy. A migration script will be provided to convert the old File, Folder and Image data into the format used by the
FileReferences
.The same situation will occur when an APL is exchanged on a project with existing data - storage format is proprietary to the APL, so the data needs to be transformed for the new APL to understand it.
Internationalisation & localisation
We expect minimal or no modifications to the UI, so there should be no impact to i18n.
For l10n, situation will not change either: with Simple APL the filesystem representation will remain as is. The database representation will change, but it will still hold the filenames. Other parts of the tuple do not introduce any l10n considerations.
References
Appendix A: APL interface
Here is our proposition for the initial base APL interface. Note that we expect this interface to undergo some degree of evolution. We might discover other generic methods that can ba added, and we also need to decide how to provide for the delete, copy, move and "check if exists" operations.
The tuple needs to be included in all calls and is always
$hash, $variant, $filename, $parentObj
. Any values in the tuple may be set to null to indicate the caller is expecting the APL implementation to generate the values. The possibly modified tuple is also returned from the setters.See the "Proposal" chapter for more information on the tuple.
Appendix B: FileReference class mockup
Here is our initial proposition for the base
FileReference
interface. As with the APL interface, note that we expect this interface to undergo some degree of evolution.See the "Proposal" chapter for more information on the
FileReference
.Appendix C: Trivial APL method
Example pseudocode for a trivial APL method to illustrate several concepts from this RFC. Note this is not a code from any actual APL - Simple APL will use Flysystem and will be more complex.