mateusz commented 9 years ago

Note to readers: the body of this RFC contains the current proposal, fully updated based on feedback. It's not necessary to read the comments, however they might aid your understanding and contain answers to some of your questions.

RFC-1 Asset abstraction

Authors: @mateusz, @hafriedlander
Status: approved by core-committers
Version: 1.2
Meetup talk: http://vimeo.com/118548773
Purpose and outcome

This RFC proposes a change to the SilverStripe Framework to add flexibility to the asset subsystem.

Community is welcome to give feedback. The proposal will be submitted to the core-committers for approval. We expect it will evolve further during development, and it will eventually form a basis for an architecture document.

Motivation

The driver for this change is solving the following problems we keep encountering in practice:

changing root path for asset storage
support Amazon S3 as a backend
support backends for sites residing on multiple servers (clustered sites)
have files versioned
have files with workflow
be able to model private files that do not appear in the Files & Images section

We currently cannot accomplish this due to the monolithic approach to the file handling in the Framework. This capability is currently expressed as a set of specialised DataObjects: File, Folder and Image which implicitly tie the database representation to the local filesystem, preventing any flexibility.

Additionaly, we self-impose the following requirements which we see as important:

ability to preserve the current assets UI
ability to connect into the existing asset storage (i.e. a degree of backwards compatibility)
Proposal

To solve these problems we propose to:

Create a backend-independent way of referring to files in the Framework.
"Degrade" files from entities in their own right (DataObjects) to model properties (DBFields).
Hide all file operations behind an interface.

Specifically, we would do so by introducing a new concept of a generalised file reference - a set of values (i.e. a tuple) uniquely pointing to a file regardless of the storage backend in use.

Then we would change the model representation of the files: instead of using the coarse-grained data objects such as File we would move to using the $db fields. For this a new DBField would be introduced, called FileReference.

The File and Folder would retain their APIs as much as possible, but be retrofitted to use the FileReference. They would now conceptually become an inseparable part of the Files & Images subsystem of the CMS.

UploadField would be updated to be able to work directly with the FileReferences, additonally to the existing has_one support.

Finally, we would remove all direct filesystem operations from the Framework and place them behind the new Asset Persistence Layer (APL) interface. This RFC does not define how an APL should work under the hood. Neither does it define the architecture nor interfaces of the storage backends. APL is tied to data stored by that APL, and changing the APL will cause data loss without a migration script.

This RFC is targeted at the next major release.

File reference tuple

File reference is a set of values (i.e. a tuple) describing a file uniquely, and independently of a storage backend. It must be generalised to allow different kinds of backends. We envisage the following values to be included in this tuple:

	Hash	Variant	Filename	ParentObject
Example	"d0be2dc..."	"Resized640x480"	"logo.jpg"	A `File` reference
Description	sha1 of the base content	Variant of the base content. "null" for the base file	The name of the file	The specific object that contains the given tuple in its FileReference
Rationale	For managing content conflicts	For supporting derived files	For reconstructing the URL given a file reference and for managing naming conflicts	For reconstructing the URL and the directory structure

The available values are not extensible. Backends may use any or all values of the tuple.

The tuple itself is not represented in the codebase directly as a class or an object, but manifests in the FileReference as stored fields and in the discrete APL operation parameters and return values.

FileReference

FileReference is a mechanism for expressing the file reference in the code and for storing it in the database. It would most likely be implemented as a packed or composite DBField, stored in table columns. The ParentObject does not need to be stored because it can be derived from the containing data object.

Additionally, the FileReference acts as a convenient proxy object to the APL interface, since it has the ability to introspect all the necessary values (APL may still be used directly, albeit more verbosely). By necessity the FileReferences API it would follow the APL interface as laid out in "Appendix A: APL interface".

FileReference will deal internally with some of the plumbing required by the APL, such as passing the ParentObject reference, or storing the returned, possibly updated, tuple values (see the "Asset Persistence Layer (APL)" chapter for code samples).

$file = File::get()->byID(1);

// The following becomes deprecated, we no longer are allowed direct access:
// file_get_contents($file->Filename)

// Instead we need to operate through the FileReference:
$fileReference = $file->obj('FileReference');
$fileReference->getAsString();

FileReference can be attached to any DataObject - not only to Files.

class MyDocument extends DataObject {
    private static $db = array(
        'AttachedFile' => 'FileReference'
    );
    ...
}

If you want specific values in the tuple, you will need to initialise them before storing any data. As an example, you might want to give the APL a hint that you require a specific Filename.

$doc = new MyDocument();
// If using CompositeDBField, one could set the sub-fields directly.
$doc->AttachedFile = array(null, null, 'document.txt');
$doc->obj('AttachedFile')->setFromString('James James Morrison Morrison');

See the "Appendix B: FileReference class mockup" for the proposed layout of that class.

Asset Persistence Layer (APL)

All existing direct file operations in the Framework would be rewritten to use the APL, either indirectly through the FileReference, or directly. A concrete APL would be configured once per site using a dependency injection.

APL requires all File Reference tuple values to be passed to all its operations.

Expanding on the previous example:

// Under the hood of the FileReference
function getAsString() {
    // Obtain the APL reference, most likely via DI.
    $apl = Injector::inst()->get('AssetPersistenceLayer');
    // Obtain the parent object reference.
    $parentObj = DataList::create($this->getParentClass())->byID($this->getParent());

    // Pass the tuple values as discrete parameters.
    return $apl->getAsString($this->Hash, $this->Variant, $this->Filename, $parentObj);
}

See the "Appendix A: APL interface" for the proposed interface to the APL.

Using parameters is at the APL discretion - it would be perfectly legal for an implementation to ignore some of the values as long as it can ensure uniqueness.

Additionally, APL setters may modify the tuple values passed to it to ensure consistency. Callers have to inspect the return value and make sure to update their understanding of the file accordingly. This would for example be used in the situation of filename conflicts. Also see the "Conflict resolution" chapter below.

A special case of a tuple rewrite is when a caller passes "null" as one of the tuple values. Caller can then expect the APL implementation to generate the value. This would be used when creating new files to obtain a hash.

Internally, APL's storage format should be treated as proprietary to that APL. APL's are interchangeable in terms of the API, but not the data already stored - a data migration script would be necessary for that.

See "Appendix C: Trivial APL method" for some contextual pseudocode.

Storage backends

This RFC does not define how a concrete APL would work under the hood. The architecture and the interfaces of the storage backends is left to the implementor.

That said, we expect the Flysystem will be useful for this.

Simple APL

A default, mostly-backwards-compatible implementation of the APL would be built as part of this RFC. With this APL, it would be possible for the apparent filesystem on disk would remain as it currently is to allow easier migration of the existing environments.

The Hash value in the tuple would be ignored by this backend and most likely Flysystem will be used as backend.

Although the Simple APL would support different backends through the Flysystem, because of the problems described in the "Asynchronous APL API" and the "Performance" chapters, we wouldn't recommend using the S3 backend here.

Other considerations

Handling directories

A directory (a "folder") is a concept that is not reflected in the APL interface. APL operations can only handle file references.

Under the hood however, a concrete APL may use the tuple values provided in the method calls to reconstruct the directories (specifically, the ParentObject might be used to recover the structure). It is entirely up to the implementation to handle this.

On the other end of the spectrum, the Files & Images' notion of "folders" will remain expressed as Folder data objects, but their FileReference would be set to null (again, that's because the APL does not handle folders explicitly).

Conflict resolution

A conflict may occur if an attempt is made to write data to a tuple which already exists. The APL caller can pass their resolution preference though the $conflictResolution parameter.

The APL may support the following resolution approaches:

"exception" - throw an APLConflict exception.
"overwrite" - overwrite the data at the tuple
"rename" - allow the APL to change the tuple values to non-conflicting and write the data to a new location. APL will return the modified tuple.
Derived files

The Image class provides file processing services. Files derived in this way are not represented in the database and are currently written to disk to an arbitrary location by the GDBackend.

To keep this RFC focused on the file abstraction, we propose to do the minimum adjustment needed to remove the direct filesystem operations while avoiding refactoring of the processing subsystem.

The FileReference would allow us to obtain a derived file reference (which is not stored in the database).

// Obtain the derived FileReference from an existing one.
$derivedReference = $fileReference->derive("Resized640x480");

// Generate if needed.
if (some_way_of_checking_if_we_need_to_generate($derivedReference)) {
    $derivedContent = existing_derivation_function($fileReference->getAsString());
    $derivedReference->setFromString($derivedContent);
}

// Work with the new reference as normal.
$derivedReference->getAsURL();

Note the APL should not change tuple values for derived files because we have no place to store them on the consuming side.

Images in content

These will need to be rewritten to use shortcodes. Direct file references by path are no longer valid.

Rationale

How does this solve the mentioned problems

Changing root path for asset storage

With all filesystem operations abstracted away, the default APL can include a configuration parameter for setting the filesystem root path.

Amazon S3 and clustered sites

The access pattern used by an APL can be non-trivial. An example of a more sophisticated approach to persistency is a load-balanced site using S3 as a primary storage, and the filesystem as a secondary storage. Such a site would have the following problems:

S3 is eventually consistent and so the resource might not yet be there when we ask for it, especially if it has just been written.
the asset might have been uploaded to another machine in the cluster, so the file might not yet be synced back to the local filesystem.

Here is a sequence diagram illustrating a cascading operation of such an APL. The backends are internal to the APL and invisible to the userspace code.

Versioned files

The APL does not deal with file versions, but it has the potential to store many distinct files with the same name thanks to the presence of the Hash in the file reference tuple.

The FileReference serialises to the database as strings, so it may versioned just as any other database field. This paves the way to using Versioned extension directly on File.

Private files

Once we switch to the File References, using File objects to manage files would no longer be obligatory. It would now be possible to bypass File and "attach" files to arbitrary DataObjects using the File Reference DBField directly.

private static $db = array(
    'AttachedFile' => 'FileReference'
);

This would mean opting out of the usual functionality - the file would no longer appear in the "Files & Images" section. Custom methods to manipulate the assets would need to be built, or the UploadField could be used to directly interface with the FileReferences.

Files with workflow

Essentially a mix of the file versioning and private files approach could be used to accomplish the workflow. Since files are now treated as model properties, we have a freedom to create new models and interfaces. We can also easily move files between the models by copying the tuple - e.g. from the "Workflow" model into the "Files & Images".

Alternative proposals

micmania1's pull request refactors the Framework so that the filesystem operations are abstracted away, but the file references are not generalised (i.e. the backend still utilises the notion of path).
Marcus's proposal is similar to this RFC in that the filesystem operations are abstracted away, and a notion of a file reference is introduced (called a pointer - "ContentId"). What differs is this solution does not involve Framework modifications and is delivered as modules.
Common objections
Targeted version was seen as 4.x. The common argument was that with a change of this size maintaining backwards-compatibility with the 3.x branch according to semver would be distracting and hard. We have changed this RFC to target 4.x.
Versioning of files was not universally seen as an important driver for this refactoring. It was thought that just decoupling the filesystem operations would be a good-enough solution (micmania1's solution would work in this case). This RFC is of opposite opinion: a potential for file versioning is a strategic feature of the Framework.
File reference composition varied between people. Some seen it as just the hash. Some preferred additional metadata. This RFC opinion is to settle on a minimum that allows us to deliver the goals, while still allowing for rich URLs and sharding.
File refenrece representation was variously seen as an object containing all tuple values, or just a string. This RFC's opinion is to use discrete values for a less obscure API.
Deferred & rejected features

Deferred: Rewriting derived files subsystem into a separate API

Deferred to keep this RFC focused. See "Derived files" chapter.

Rejected: Versioned APL

We have decided the APL is not the right layer to implement file versioning and that it should be implemented on a higher level. This RFC will make it possible to accomplish both versioning folders and files.

See "How does this solve the problems" chapter.

Rejected: Tuple as an object

This would be too complex and ultimately make the conceptual model too convoluted.

The drawback of not doing this is that in this proposal the FileReference must act both as the tuple representation in code (with all tuple manipulation methods), and as a database field representation.

This violates the single-responsibility principle and leads to edge-case problems such as the one presented in the "Derived files" chapter, where a crippled FileReference is produced that does not represent any database entity (which is counter-intuitive, because it's a DBField after all).

Rejected: Extending the tuple with more values

We think the tuple is such a fundamental representation of the file resource that it should be fixed (just as a Varchar representation is fixed). Tuple was designed such that it should contain enough specificity to allow for all use cases listed in "Motivation" chapter.

Allowing to extend it with custom values would cause module compatibility problems, and we want the APLs to be interchangeable.

Rejected: Leaving derived files subsystem as it is (with direct local filesystem operations)

This wouldn't work for clustered sites. File derivation isn't always triggered on demand, which means some derived file will only be available on some local filesystem. Passing the derived files into the APL makes sure the files are always treated consistently.

For the same reason this won't work for deployments where there is no local persistent storage available (e.g. AWS deployment utilising only S3).

Rejected: Asynchronous APL API

Writes to a backend might take long time, it's unclear what a "rename" or "exception" conflict resolution should do if we execute subsequent write to the same resource reference without waiting for completion. In this case we could end up losing data.

To solve this we could introduce an asynchronous API for the APL. A complete tuple corresponds to a single write operation, so we could possibly add a "getWriteStatus()" method to the APL. It could then return "in progress", "success", "failure". a UI could poll that until the result was success; if it returned failure then the rename/replace/retry logic could be put in place.

However this can also be solved by using an APL that supports content hashes (there woudln't ever be any conflicts in such an APL) so we decided that it's not worth introducing the additional complexity related to the asynchronous API.

It's worth noting the Flysystem developers discarded asynchronous operation in their API.

Impact

Backwards-incompatible changes

If using the default Simple APL, we are aiming at minimising the backwards-incompatible changes when compared to "3.x" Framework branch. There will certainly be a DB schema change to File.

For the Simple APL with non-default configuration, for other APLs, and for future major Framework versions the following changes are expected:

API changes for File, Image and Folder (e.g. removal of Filename field).
Direct filesystem access no longer permitted (will break with non-default APLs).
ASSETS_DIR and ASSETS_PATH removed.
Shortcodes for Content imagery introduced - non-shortcoded images will cease to work with non-default APLs.
possibly more: to be seen during the development.
Security

The significant impact is that until the "secureassets" module is rewritten to support the APL and to work with the tuples there won't be any easy way to support secure files.

From the user perspective, there won't be any changes in how File objects handle security. AssetAdmin, GridField and Subsites should not be impacted.

The real potential for improvement lies in custom APLs which will be able to secure files in ways different from the .htaccess approach. The FileReference won't have any inherent notion of security and it will be up to the implementors of the APL to derive the security properties from the file reference tuple (especially the ParentObject).

For example, APL, while performing a write operation, could distinguish between secure (proxied) and insecure (directly accessible) assets by inspecting the ParentObject. It'd then store the resource in an appropriate backend. It would be up to the "secure" backend to ensure the authorisation is done upon a request, e.g. by passing the request through a Controller.

Performance

The Simple APL will write to the local filesystem, so we expect the performance not to be different than before the change. The APL method calls will be synchronous.

However a special consideration needs to be given when writing APLs with long-running operations. The APL layer does not inherently support asynchronicity: writing data asynchronously leaves the backend in an unspecified state and the API does not make any provisions for introspecting this state. Instead we recommend using content hashes in your APL (see also the "Asynchronous APL API" chapter) - this allows the APL to avoid tuple clashes and unexpected overwrites of content.

Another thing to keep in mind for custom APLs is that the presence of long-running synchronous operations will impact the UI considerably. One solution here could be to execute some part of the write synchronously (i.e. to the local filesystem), but delegate the long-running writes to a queue.

Scalability

This change is a prerequisite for better scalability. Shifting the assets off the local machine is required for any kind of clustered or scalable environment. That said, the work included in this RFC will not affect the scalability directly - the Simple APL will predominantly be targeted towards the local filesystem.

Maintenance & operational

Backups

With Simple APL, the backups will be done as before: a database dump and the filesystem snapshot are enough to restore all data.

However backups may not be trivial when other APLs are used. People performing maintenance will need to know how to find out which APL is in use, and act accordingly.

Migration

With Simple APL the filesystem representation will remain in place. However on the database level the migration from older sites will not be a straighforward copy. A migration script will be provided to convert the old File, Folder and Image data into the format used by the FileReferences.

The same situation will occur when an APL is exchanged on a project with existing data - storage format is proprietary to the APL, so the data needs to be transformed for the new APL to understand it.

Internationalisation & localisation

We expect minimal or no modifications to the UI, so there should be no impact to i18n.

For l10n, situation will not change either: with Simple APL the filesystem representation will remain as is. The database representation will change, but it will still hold the filenames. Other parts of the tuple do not introduce any l10n considerations.

References

Here is our proposition for the initial base APL interface. Note that we expect this interface to undergo some degree of evolution. We might discover other generic methods that can ba added, and we also need to decide how to provide for the delete, copy, move and "check if exists" operations.

The tuple needs to be included in all calls and is always $hash, $variant, $filename, $parentObj. Any values in the tuple may be set to null to indicate the caller is expecting the APL implementation to generate the values. The possibly modified tuple is also returned from the setters.

See the "Proposal" chapter for more information on the tuple.

interface AssetPersistenceLayer {

    /**
     * Write the data directly.
     *
     * The tuple may be modified by this method - it's the caller's responsibility
     * to update their data structures based on the return value. However
     * the APL should not change tuple values for derived files - i.e. if the Variant
     * value of the tuple is not null.
     *
     * @param $hash string|null Tuple part 1
     * @param $variant string|null Tuple part 2  
     * @param $filename string|null Tuple part 3
     * @param $parentObj Object|null Tuple part 4
     * @param $data Binary data to set on the file reference.
     * @param $conflictResolution string Conflict resolution hint (exception|overwrite|rename)
     *
     * @return array($hash, $variant, $filename, $parentObj) The (possibly modified) file reference tuple.
     */
    function setFromString($hash, $variant, $filename, $parentObj, $data, $conflictResolution);

    /**
     * Possible variant #1: it might be faster for the implementation to operate directly from disk.
     */
    function setFromLocalFile(..., $path, ...);

    /**
     * Possible variant #2: for large objects it would be useful to provide for direct stream handling.
     */
    function setFromStream(..., $stream, ...);

    /** 
     * @param $hash string|null Tuple part 1
     * @param $variant string|null Tuple part 2  
     * @param $filename string|null Tuple part 3
     * @param $parentObj Object|null Tuple part 4
     *
     * @return string Data from the file.
     */
    function getAsString($hash, $variant, $filename, $parentObj);

    /**
     * Possible variant #1: for large objects it might be useful to provide for direct stream access.
     */
    function getAsStream(...);

    /**
     * @param $hash string|null Tuple part 1
     * @param $variant string|null Tuple part 2  
     * @param $filename string|null Tuple part 3
     * @param $parentObj Object|null Tuple part 4
     *
     * @return string URL for the data blob that can be fetched directly by the user.
     */
    function getAsURL($hash, $variant, $filename, $parentObj);
}

Appendix B: FileReference class mockup

Here is our initial proposition for the base FileReference interface. As with the APL interface, note that we expect this interface to undergo some degree of evolution.

See the "Proposal" chapter for more information on the FileReference.

class FileReference implement CompositeDBField {

    /**
     * Write the data directly to the tuple represented by this FileReference.
     *
     * @param $data Binary data to set on the file reference.
     * @param $conflictResolution string Conflict resolution hint (exception|overwrite|rename)
     */
    function setFromString($data, $conflictResolution);

    /**
     * It might be faster for the implementation to operate directly on the file,
     * so the consuming code may choose to use this instead.
     *
     * @param $path Absolute path to the file
     * @param $conflictResolution string Conflict resolution hint (exception|overwrite|rename)
     */
    function setFromLocalFile($path, $conflictResolution);

    /** 
     * Get the data from the file resource represented by this FileReference.
     * @return string Data from the file.
     */
    function getAsString();

    /**
     * Get the URL for the file resource represented by this FileReference. 
     *
     * @return string URL for the data blob that can be fetched directly by the user.
     */
    function getAsURL();

    // ... other methods implemented as required by the CompositeDBField interface.
}

Appendix C: Trivial APL method

Example pseudocode for a trivial APL method to illustrate several concepts from this RFC. Note this is not a code from any actual APL - Simple APL will use Flysystem and will be more complex.

function setFromString(
    $hash,
    $variant,
    $filename,
    $parentObj,
    $data,
    $conflictResolution
) {

    // This backend does not support hashes, make sure it's null.
    $hash = null;

    // It is technically legal to pass an empty tuple value and expect the APL
    // to generate it.
    if (!$filename) $filename = generate_random_name();

    // Find out the directory from the context.
    if ($parentObj instanceof File) {
        $path = $parentObj->getParent()->getRelativePath() . '/';
    } else {
        $path = 'homeless/';
    }

    // Configurable root directory for the backend.
    $path = $this->config()->root_directory . '/' . $path;

    // Do not permit tuple changes for derived files.
    if ($variant && $conflictResolution!='overwrite') throw new Exception(...);

    // Variable conflict handling.
    if (file_exists($path . $filename)) {
        switch($conflictResolution) {
            case 'exception': 
                throw new APLConflict(...);
                break;
            case 'rename':
                $filename = $this->resolveFilenameConflict($filename);
                break;
        }
    }

    // Finally - write the data.
    file_put_contents($path . $filename, $data);

    // Return the new - possibly updated - tuple.
    return array($hash, $variant, $filename, $parentObj);
}

mateusz commented 9 years ago

@dhensby This RFC is targeting branch 4 now (next major) :-)

If a DO stores a tuple of a file and I want to link the same file to another DO.

It's effectively a different file resource, because the tuple is different (specifically, the ParentObject). However see the comment above: it might still be stored as a single data blob in the backend, because some APLs will ignore ParentObject reference. So: it is different from the Framework perspective, but might be the same piece of data from the APL perspective.

Well, I guess that ties back to, how do we compare different ParentObjects for uniqueness. I don't know yet, it's a conversation that will get resolved when we try to do some coding :-)

Versioning is up to the higher-level systems (i.e. Versioned extension). Different tuple means a different file resource, so as long as it differs between versions (i.e. by Hash or Filename or whatever), you can accomplish versioning.

The Files&Images section will (hopefully) work as it currently does - we won't actually be implementing versioning there, or adding draft support. The File hierarchy will still be our "central filestore", except that now it will delegate data manipulation to the APL. There will still be no symlinking capability. If you create a File, it will be refer to the data via the FileReference. If you update a File, it will need to decide if the FileReference changed.

Would it be more clear if the FileReference was called DataReference?

mateusz commented 9 years ago

Cam published the meetup talk about this RFC. Excuse the accent :-)

nyeholt commented 9 years ago

Maybe ContentReference ? Extrapolating things out, conceivably you could use a tuple format to reference any kind of content, but DataReference feels a bit too pigeon-holed.

assertchris commented 9 years ago

Other than versioning and private files: is there anything Flysystem + relative database file paths do not handle? Trying to understand the needs/priorities for the added complexity. :)

mateusz commented 9 years ago

@assertchris by definition, any asset persistency system that needs/wants to use hashes :-) But versioning is important - it's a basic premise of this RFC. If that premise is challenged, a new RFC needs to be written :-)

assertchris commented 9 years ago

@mateusz Out of my depth, in here. Moving along... :)

stojg commented 9 years ago

So besides arguing about which semantic version this goes into or what the low level API is no-one has been screaming blue murder?

mateusz commented 9 years ago

No murder. Even better: the core-committers approved this, so the games may begin. I mean implementation. I hope we can start implementing this soon! Yay! :fireworks:

nyeholt commented 9 years ago

One query from the implementation side of things... is it envisaged that there will be a set of explicit interfaces defined for a concrete implementation to be written against, or an implied interface set based on an initial concrete implementation?

clyonsEIS commented 9 years ago

No updates for a bit, is this underway or on deck soon? I think it would increase SS adoption significantly since horizontal scaling is quite important and tricky without this.

willmorgan commented 9 years ago

@clyonsEIS I've clarified the milestones and added the pretty multicolour labels, so anyone coming from UserVoice can quickly see the status.

https://silverstripe.uservoice.com/forums/251266-new-features/suggestions/6572660-asset-abstraction

tractorcow commented 9 years ago

I've created a PR for initial implementation of this feature (not including updating of File dataobject). Or things like file variants.

https://github.com/silverstripe/silverstripe-framework/issues/4599

tractorcow commented 9 years ago

Second pull request is at https://github.com/silverstripe/silverstripe-framework/pull/4652

sminnee commented 8 years ago

@tractorcow I think this has now been completed?

tractorcow commented 8 years ago

Yes this is done!

silverstripe / silverstripe-framework