silverstripe / silverstripe-framework

Silverstripe Framework, the MVC framework that powers Silverstripe CMS
https://www.silverstripe.org
BSD 3-Clause "New" or "Revised" License
721 stars 822 forks source link

RFC-1 Asset abstraction #3792

Closed mateusz closed 8 years ago

mateusz commented 9 years ago

Note to readers: the body of this RFC contains the current proposal, fully updated based on feedback. It's not necessary to read the comments, however they might aid your understanding and contain answers to some of your questions.

RFC-1 Asset abstraction

This RFC proposes a change to the SilverStripe Framework to add flexibility to the asset subsystem.

Community is welcome to give feedback. The proposal will be submitted to the core-committers for approval. We expect it will evolve further during development, and it will eventually form a basis for an architecture document.

Motivation

The driver for this change is solving the following problems we keep encountering in practice:

We currently cannot accomplish this due to the monolithic approach to the file handling in the Framework. This capability is currently expressed as a set of specialised DataObjects: File, Folder and Image which implicitly tie the database representation to the local filesystem, preventing any flexibility.

Additionaly, we self-impose the following requirements which we see as important:

To solve these problems we propose to:

  1. Create a backend-independent way of referring to files in the Framework.
  2. "Degrade" files from entities in their own right (DataObjects) to model properties (DBFields).
  3. Hide all file operations behind an interface.

Specifically, we would do so by introducing a new concept of a generalised file reference - a set of values (i.e. a tuple) uniquely pointing to a file regardless of the storage backend in use.

Then we would change the model representation of the files: instead of using the coarse-grained data objects such as File we would move to using the $db fields. For this a new DBField would be introduced, called FileReference.

The File and Folder would retain their APIs as much as possible, but be retrofitted to use the FileReference. They would now conceptually become an inseparable part of the Files & Images subsystem of the CMS.

UploadField would be updated to be able to work directly with the FileReferences, additonally to the existing has_one support.

Finally, we would remove all direct filesystem operations from the Framework and place them behind the new Asset Persistence Layer (APL) interface. This RFC does not define how an APL should work under the hood. Neither does it define the architecture nor interfaces of the storage backends. APL is tied to data stored by that APL, and changing the APL will cause data loss without a migration script.

This RFC is targeted at the next major release.

File reference tuple

File reference is a set of values (i.e. a tuple) describing a file uniquely, and independently of a storage backend. It must be generalised to allow different kinds of backends. We envisage the following values to be included in this tuple:

Hash Variant Filename ParentObject
Example "d0be2dc..." "Resized640x480" "logo.jpg" A File reference
Description sha1 of the base content Variant of the base content. "null" for the base file The name of the file The specific object that contains the given tuple in its FileReference
Rationale For managing content conflicts For supporting derived files For reconstructing the URL given a file reference and for managing naming conflicts For reconstructing the URL and the directory structure

The available values are not extensible. Backends may use any or all values of the tuple.

The tuple itself is not represented in the codebase directly as a class or an object, but manifests in the FileReference as stored fields and in the discrete APL operation parameters and return values.

FileReference

FileReference is a mechanism for expressing the file reference in the code and for storing it in the database. It would most likely be implemented as a packed or composite DBField, stored in table columns. The ParentObject does not need to be stored because it can be derived from the containing data object.

Additionally, the FileReference acts as a convenient proxy object to the APL interface, since it has the ability to introspect all the necessary values (APL may still be used directly, albeit more verbosely). By necessity the FileReferences API it would follow the APL interface as laid out in "Appendix A: APL interface".

FileReference will deal internally with some of the plumbing required by the APL, such as passing the ParentObject reference, or storing the returned, possibly updated, tuple values (see the "Asset Persistence Layer (APL)" chapter for code samples).

$file = File::get()->byID(1);

// The following becomes deprecated, we no longer are allowed direct access:
// file_get_contents($file->Filename)

// Instead we need to operate through the FileReference:
$fileReference = $file->obj('FileReference');
$fileReference->getAsString();

FileReference can be attached to any DataObject - not only to Files.

class MyDocument extends DataObject {
    private static $db = array(
        'AttachedFile' => 'FileReference'
    );
    ...
}

If you want specific values in the tuple, you will need to initialise them before storing any data. As an example, you might want to give the APL a hint that you require a specific Filename.

$doc = new MyDocument();
// If using CompositeDBField, one could set the sub-fields directly.
$doc->AttachedFile = array(null, null, 'document.txt');
$doc->obj('AttachedFile')->setFromString('James James Morrison Morrison');

See the "Appendix B: FileReference class mockup" for the proposed layout of that class.

Asset Persistence Layer (APL)

All existing direct file operations in the Framework would be rewritten to use the APL, either indirectly through the FileReference, or directly. A concrete APL would be configured once per site using a dependency injection.

APL requires all File Reference tuple values to be passed to all its operations.

Expanding on the previous example:

// Under the hood of the FileReference
function getAsString() {
    // Obtain the APL reference, most likely via DI.
    $apl = Injector::inst()->get('AssetPersistenceLayer');
    // Obtain the parent object reference.
    $parentObj = DataList::create($this->getParentClass())->byID($this->getParent());

    // Pass the tuple values as discrete parameters.
    return $apl->getAsString($this->Hash, $this->Variant, $this->Filename, $parentObj);
}

See the "Appendix A: APL interface" for the proposed interface to the APL.

Using parameters is at the APL discretion - it would be perfectly legal for an implementation to ignore some of the values as long as it can ensure uniqueness.

Additionally, APL setters may modify the tuple values passed to it to ensure consistency. Callers have to inspect the return value and make sure to update their understanding of the file accordingly. This would for example be used in the situation of filename conflicts. Also see the "Conflict resolution" chapter below.

A special case of a tuple rewrite is when a caller passes "null" as one of the tuple values. Caller can then expect the APL implementation to generate the value. This would be used when creating new files to obtain a hash.

Internally, APL's storage format should be treated as proprietary to that APL. APL's are interchangeable in terms of the API, but not the data already stored - a data migration script would be necessary for that.

See "Appendix C: Trivial APL method" for some contextual pseudocode.

Storage backends

This RFC does not define how a concrete APL would work under the hood. The architecture and the interfaces of the storage backends is left to the implementor.

That said, we expect the Flysystem will be useful for this.

Simple APL

A default, mostly-backwards-compatible implementation of the APL would be built as part of this RFC. With this APL, it would be possible for the apparent filesystem on disk would remain as it currently is to allow easier migration of the existing environments.

The Hash value in the tuple would be ignored by this backend and most likely Flysystem will be used as backend.

Although the Simple APL would support different backends through the Flysystem, because of the problems described in the "Asynchronous APL API" and the "Performance" chapters, we wouldn't recommend using the S3 backend here.

Other considerations

Handling directories

A directory (a "folder") is a concept that is not reflected in the APL interface. APL operations can only handle file references.

Under the hood however, a concrete APL may use the tuple values provided in the method calls to reconstruct the directories (specifically, the ParentObject might be used to recover the structure). It is entirely up to the implementation to handle this.

On the other end of the spectrum, the Files & Images' notion of "folders" will remain expressed as Folder data objects, but their FileReference would be set to null (again, that's because the APL does not handle folders explicitly).

Conflict resolution

A conflict may occur if an attempt is made to write data to a tuple which already exists. The APL caller can pass their resolution preference though the $conflictResolution parameter.

The APL may support the following resolution approaches:

The Image class provides file processing services. Files derived in this way are not represented in the database and are currently written to disk to an arbitrary location by the GDBackend.

To keep this RFC focused on the file abstraction, we propose to do the minimum adjustment needed to remove the direct filesystem operations while avoiding refactoring of the processing subsystem.

The FileReference would allow us to obtain a derived file reference (which is not stored in the database).

// Obtain the derived FileReference from an existing one.
$derivedReference = $fileReference->derive("Resized640x480");

// Generate if needed.
if (some_way_of_checking_if_we_need_to_generate($derivedReference)) {
    $derivedContent = existing_derivation_function($fileReference->getAsString());
    $derivedReference->setFromString($derivedContent);
}

// Work with the new reference as normal.
$derivedReference->getAsURL();

Note the APL should not change tuple values for derived files because we have no place to store them on the consuming side.

Images in content

These will need to be rewritten to use shortcodes. Direct file references by path are no longer valid.

Rationale

How does this solve the mentioned problems

Changing root path for asset storage

With all filesystem operations abstracted away, the default APL can include a configuration parameter for setting the filesystem root path.

Amazon S3 and clustered sites

The access pattern used by an APL can be non-trivial. An example of a more sophisticated approach to persistency is a load-balanced site using S3 as a primary storage, and the filesystem as a secondary storage. Such a site would have the following problems:

Here is a sequence diagram illustrating a cascading operation of such an APL. The backends are internal to the APL and invisible to the userspace code.

Versioned files

The APL does not deal with file versions, but it has the potential to store many distinct files with the same name thanks to the presence of the Hash in the file reference tuple.

The FileReference serialises to the database as strings, so it may versioned just as any other database field. This paves the way to using Versioned extension directly on File.

Private files

Once we switch to the File References, using File objects to manage files would no longer be obligatory. It would now be possible to bypass File and "attach" files to arbitrary DataObjects using the File Reference DBField directly.

private static $db = array(
    'AttachedFile' => 'FileReference'
);

This would mean opting out of the usual functionality - the file would no longer appear in the "Files & Images" section. Custom methods to manipulate the assets would need to be built, or the UploadField could be used to directly interface with the FileReferences.

Files with workflow

Essentially a mix of the file versioning and private files approach could be used to accomplish the workflow. Since files are now treated as model properties, we have a freedom to create new models and interfaces. We can also easily move files between the models by copying the tuple - e.g. from the "Workflow" model into the "Files & Images".

Alternative proposals

Deferred to keep this RFC focused. See "Derived files" chapter.

Rejected: Versioned APL

We have decided the APL is not the right layer to implement file versioning and that it should be implemented on a higher level. This RFC will make it possible to accomplish both versioning folders and files.

See "How does this solve the problems" chapter.

Rejected: Tuple as an object

This would be too complex and ultimately make the conceptual model too convoluted.

The drawback of not doing this is that in this proposal the FileReference must act both as the tuple representation in code (with all tuple manipulation methods), and as a database field representation.

This violates the single-responsibility principle and leads to edge-case problems such as the one presented in the "Derived files" chapter, where a crippled FileReference is produced that does not represent any database entity (which is counter-intuitive, because it's a DBField after all).

Rejected: Extending the tuple with more values

We think the tuple is such a fundamental representation of the file resource that it should be fixed (just as a Varchar representation is fixed). Tuple was designed such that it should contain enough specificity to allow for all use cases listed in "Motivation" chapter.

Allowing to extend it with custom values would cause module compatibility problems, and we want the APLs to be interchangeable.

Rejected: Leaving derived files subsystem as it is (with direct local filesystem operations)

This wouldn't work for clustered sites. File derivation isn't always triggered on demand, which means some derived file will only be available on some local filesystem. Passing the derived files into the APL makes sure the files are always treated consistently.

For the same reason this won't work for deployments where there is no local persistent storage available (e.g. AWS deployment utilising only S3).

Rejected: Asynchronous APL API

Writes to a backend might take long time, it's unclear what a "rename" or "exception" conflict resolution should do if we execute subsequent write to the same resource reference without waiting for completion. In this case we could end up losing data.

To solve this we could introduce an asynchronous API for the APL. A complete tuple corresponds to a single write operation, so we could possibly add a "getWriteStatus()" method to the APL. It could then return "in progress", "success", "failure". a UI could poll that until the result was success; if it returned failure then the rename/replace/retry logic could be put in place.

However this can also be solved by using an APL that supports content hashes (there woudln't ever be any conflicts in such an APL) so we decided that it's not worth introducing the additional complexity related to the asynchronous API.

It's worth noting the Flysystem developers discarded asynchronous operation in their API.

Impact

Backwards-incompatible changes

If using the default Simple APL, we are aiming at minimising the backwards-incompatible changes when compared to "3.x" Framework branch. There will certainly be a DB schema change to File.

For the Simple APL with non-default configuration, for other APLs, and for future major Framework versions the following changes are expected:

The significant impact is that until the "secureassets" module is rewritten to support the APL and to work with the tuples there won't be any easy way to support secure files.

From the user perspective, there won't be any changes in how File objects handle security. AssetAdmin, GridField and Subsites should not be impacted.

The real potential for improvement lies in custom APLs which will be able to secure files in ways different from the .htaccess approach. The FileReference won't have any inherent notion of security and it will be up to the implementors of the APL to derive the security properties from the file reference tuple (especially the ParentObject).

For example, APL, while performing a write operation, could distinguish between secure (proxied) and insecure (directly accessible) assets by inspecting the ParentObject. It'd then store the resource in an appropriate backend. It would be up to the "secure" backend to ensure the authorisation is done upon a request, e.g. by passing the request through a Controller.

Performance

The Simple APL will write to the local filesystem, so we expect the performance not to be different than before the change. The APL method calls will be synchronous.

However a special consideration needs to be given when writing APLs with long-running operations. The APL layer does not inherently support asynchronicity: writing data asynchronously leaves the backend in an unspecified state and the API does not make any provisions for introspecting this state. Instead we recommend using content hashes in your APL (see also the "Asynchronous APL API" chapter) - this allows the APL to avoid tuple clashes and unexpected overwrites of content.

Another thing to keep in mind for custom APLs is that the presence of long-running synchronous operations will impact the UI considerably. One solution here could be to execute some part of the write synchronously (i.e. to the local filesystem), but delegate the long-running writes to a queue.

Scalability

This change is a prerequisite for better scalability. Shifting the assets off the local machine is required for any kind of clustered or scalable environment. That said, the work included in this RFC will not affect the scalability directly - the Simple APL will predominantly be targeted towards the local filesystem.

Maintenance & operational

Backups

With Simple APL, the backups will be done as before: a database dump and the filesystem snapshot are enough to restore all data.

However backups may not be trivial when other APLs are used. People performing maintenance will need to know how to find out which APL is in use, and act accordingly.

Migration

With Simple APL the filesystem representation will remain in place. However on the database level the migration from older sites will not be a straighforward copy. A migration script will be provided to convert the old File, Folder and Image data into the format used by the FileReferences.

The same situation will occur when an APL is exchanged on a project with existing data - storage format is proprietary to the APL, so the data needs to be transformed for the new APL to understand it.

Internationalisation & localisation

We expect minimal or no modifications to the UI, so there should be no impact to i18n.

For l10n, situation will not change either: with Simple APL the filesystem representation will remain as is. The database representation will change, but it will still hold the filenames. Other parts of the tuple do not introduce any l10n considerations.

References

  1. UserVoice: Files should be stored in DataObject fields
  2. UserVoice: Filesystem Abstraction
  3. GitHub: micmania1's pull request
  4. dev-list: Storing files in dataobject fields thread
  5. dev-list: Historical ASSET_BASE_PATH thread
  6. dev-list: Historical Silverstripe (Assets) vs. Scaling in the Cloud thread
  7. Flysystem

    Appendix A: APL interface

Here is our proposition for the initial base APL interface. Note that we expect this interface to undergo some degree of evolution. We might discover other generic methods that can ba added, and we also need to decide how to provide for the delete, copy, move and "check if exists" operations.

The tuple needs to be included in all calls and is always $hash, $variant, $filename, $parentObj. Any values in the tuple may be set to null to indicate the caller is expecting the APL implementation to generate the values. The possibly modified tuple is also returned from the setters.

See the "Proposal" chapter for more information on the tuple.

interface AssetPersistenceLayer {

    /**
     * Write the data directly.
     *
     * The tuple may be modified by this method - it's the caller's responsibility
     * to update their data structures based on the return value. However
     * the APL should not change tuple values for derived files - i.e. if the Variant
     * value of the tuple is not null.
     *
     * @param $hash string|null Tuple part 1
     * @param $variant string|null Tuple part 2  
     * @param $filename string|null Tuple part 3
     * @param $parentObj Object|null Tuple part 4
     * @param $data Binary data to set on the file reference.
     * @param $conflictResolution string Conflict resolution hint (exception|overwrite|rename)
     *
     * @return array($hash, $variant, $filename, $parentObj) The (possibly modified) file reference tuple.
     */
    function setFromString($hash, $variant, $filename, $parentObj, $data, $conflictResolution);

    /**
     * Possible variant #1: it might be faster for the implementation to operate directly from disk.
     */
    function setFromLocalFile(..., $path, ...);

    /**
     * Possible variant #2: for large objects it would be useful to provide for direct stream handling.
     */
    function setFromStream(..., $stream, ...);

    /** 
     * @param $hash string|null Tuple part 1
     * @param $variant string|null Tuple part 2  
     * @param $filename string|null Tuple part 3
     * @param $parentObj Object|null Tuple part 4
     *
     * @return string Data from the file.
     */
    function getAsString($hash, $variant, $filename, $parentObj);

    /**
     * Possible variant #1: for large objects it might be useful to provide for direct stream access.
     */
    function getAsStream(...);

    /**
     * @param $hash string|null Tuple part 1
     * @param $variant string|null Tuple part 2  
     * @param $filename string|null Tuple part 3
     * @param $parentObj Object|null Tuple part 4
     *
     * @return string URL for the data blob that can be fetched directly by the user.
     */
    function getAsURL($hash, $variant, $filename, $parentObj);
}

Appendix B: FileReference class mockup

Here is our initial proposition for the base FileReference interface. As with the APL interface, note that we expect this interface to undergo some degree of evolution.

See the "Proposal" chapter for more information on the FileReference.

class FileReference implement CompositeDBField {

    /**
     * Write the data directly to the tuple represented by this FileReference.
     *
     * @param $data Binary data to set on the file reference.
     * @param $conflictResolution string Conflict resolution hint (exception|overwrite|rename)
     */
    function setFromString($data, $conflictResolution);

    /**
     * It might be faster for the implementation to operate directly on the file,
     * so the consuming code may choose to use this instead.
     *
     * @param $path Absolute path to the file
     * @param $conflictResolution string Conflict resolution hint (exception|overwrite|rename)
     */
    function setFromLocalFile($path, $conflictResolution);

    /** 
     * Get the data from the file resource represented by this FileReference.
     * @return string Data from the file.
     */
    function getAsString();

    /**
     * Get the URL for the file resource represented by this FileReference. 
     *
     * @return string URL for the data blob that can be fetched directly by the user.
     */
    function getAsURL();

    // ... other methods implemented as required by the CompositeDBField interface.
}

Appendix C: Trivial APL method

Example pseudocode for a trivial APL method to illustrate several concepts from this RFC. Note this is not a code from any actual APL - Simple APL will use Flysystem and will be more complex.

function setFromString(
    $hash,
    $variant,
    $filename,
    $parentObj,
    $data,
    $conflictResolution
) {

    // This backend does not support hashes, make sure it's null.
    $hash = null;

    // It is technically legal to pass an empty tuple value and expect the APL
    // to generate it.
    if (!$filename) $filename = generate_random_name();

    // Find out the directory from the context.
    if ($parentObj instanceof File) {
        $path = $parentObj->getParent()->getRelativePath() . '/';
    } else {
        $path = 'homeless/';
    }

    // Configurable root directory for the backend.
    $path = $this->config()->root_directory . '/' . $path;

    // Do not permit tuple changes for derived files.
    if ($variant && $conflictResolution!='overwrite') throw new Exception(...);

    // Variable conflict handling.
    if (file_exists($path . $filename)) {
        switch($conflictResolution) {
            case 'exception': 
                throw new APLConflict(...);
                break;
            case 'rename':
                $filename = $this->resolveFilenameConflict($filename);
                break;
        }
    }

    // Finally - write the data.
    file_put_contents($path . $filename, $data);

    // Return the new - possibly updated - tuple.
    return array($hash, $variant, $filename, $parentObj);
}
marijnkampf commented 9 years ago

Housekeeping note, should Flysystem https://github.com/thephpleague/flysystem be included in References?

willmorgan commented 9 years ago

Looks like a nice RFC. Some of my thoughts:

If accepted and developed then this will make a lot of lives easier. :+1:

dhensby commented 9 years ago
  1. I believe we need to be able to run multiple APLs on one site. I foresee that developers may want to store some files on a local filesystem and some remotely
  2. 3.x seems an unrealistic release milestone if we are identifying API breakages at this early stage
  3. I think the parameters defiend in the APL interface are too rigid and don't leave us future flexibility without irritating API changes. The obvious way to solve it is using a tuple object (which also makes this solution more customisable) but I see this option was rejected. I think this should be reconsidered.

@willmorgan we can't drop 5.3 support on a 3.x release

oddnoc commented 9 years ago

Although I am all for asset abstraction, I am raising 2 issues pertaining to backward compatibility.

  1. This RFC violates semantic versioning by proposing backward-incompatible changes on the 3.x branch. There needs to be a 100% compatibility layer, or this RFC must be deferred to 4.x.
  2. A migration script will need to be provided to convert old filesystem paths in content to shortcodes.

I'm being a stickler about this because it is going to be well-nigh impossible to explain to clients that we have to burn a lot of hours rewriting a lot of code and massaging their content because we want them to upgrade to SilverStripe 3.2 from 3.1. Keep in mind that these same clients will have recently had substantial expenses upgrading from 2.4. Also, the project has publicly committed to semantic versioning, so it is perplexing that the first RFC isn't in compliance.

stevie-mayhew commented 9 years ago

I have to agree with @oddnoc - I don't see any way which this can be in a 3.x release - its a major change to an API. I thought that we were adopting semantic versioning once the PDO release has come out, which I assumed would be a 3.2 release under the current schema. That would make this a 4.x release at best.

I'm also worried about the continued persistence in maintaining PHP 5.3 support and this is a perfect example - having to resolve a third party's non-support of 5.3 isn't a trivial matter (who knows what they think about it!)

hafriedlander commented 9 years ago

My expectation is that as long as you use the simple APL with the local filesystem, the system would be backwards compatible with one exception: there will be a schema change. Upgrading to the new schema would be automate-able. I'm not sure if that schema change is enough to trigger requiring a new major version. If you use anything except the simple APL, or anything except the local filesystem then you would definitely need to change code too, but I don't see this as breaking backwards compatibility. If it turned out we couldn't maintain backwards compatibility with the simple APL and local filesystem, I agree this would need to go into 4.0, but I don't see we need to pre-emptively decide that.

Regarding minimum PHP version, semver is why PHP needs to stay at 5.3. We can't change the minimum version for 3.2, because that would be a backwards-incompatible change. If this goes into 4.0 instead, and we decide that the minimum PHP version should go up for 4.0 (which I think it should, and it's likely it will, but that should be a deliberate decision, not be decided indirectly by an RFC), then we would adjust the RFC's expectations.

Regarding the APL interface's flexibility: the interface itself is a public API. Having an old APL implementation need upgrading when the API changes is a feature, not a bug. If we make it flexible, then how do we manage compatibility across multiple APLs? How do we flag when a new version of the framework's expectations of an APL have changed?

Regarding syncing files from a file source (remote or local): you could write a script to do that if you liked, but it's outside the scope of this RFC & the framework's responsibilities. Depending on the APL, it might not be possible to re-construct the database from the filestore. For instance, in the hypothetical versioned store, where everything is a hash, there are no directories. After this change, the reference is only guaranteed one way (database -> filestore) and is lossy.

I've seen no use case for sharding filestores, and allowing it significantly complicates the API, but if there's specific use cases you can think of that would make it desirable I'd be keen to hear them. Note that as designed it does allow files to be stored both locally and remotely, just not for some files to be one way and some to be another.

Hamish Friedlander | CTO SilverStripe

mateusz commented 9 years ago

Thanks for feedback everyone :-) The targeted Framework version seems to be the most controversial issue.

I have the following proposition: let's delay that decision. It's hard to say at this point what backwards-incompatible changes would need to be introduced exactly anyway. I think we should simply leave our options open: if the dev team manages to solve all BC issues during implementation (which seems technically possible), then we'll submit it against 3.x in accordance with semver. If not -> 4.x. I'll rewrite the BC section to this effect.

@oddnoc, @stevie-mayhew, @dhensby is the above approach acceptable?

Also, people have provided a couple of opinions i.e. whether the tuple should be an object, or whether we should support mutliple APLs. There is a fine line to tread there: it's like the tabs vs spaces discussion. My belief is that the best approach is to let the RFC writer's opinion prevail, because it's easiest - as long as that solution works. It doesn't need to be perfect, it just needs to work for enough people.

@dhensby, clarification regarding the "multiple APLs": you don't need to support parallel APLs to support multiple backends - see the "Amazon S3 and clustered sites" chapter. APL is designed to be more like a singleton - it's "the" abstraction layer. You can put whatever you want underneath :-)

@willmorgan, re syncing files from remote source, we will only get "Sync files & assets" work in a backwards-compatible way with the Simple API. Not sure if reflection should be built into the APL interface. What I'm saying is there is a limit of what we can architect up front - and I think we are near that limit.

@oddnoc re migration script, if you use Simple APL, you won't need to migrate the data in the Content - both representations will happily live side by side. We don't rule out writing a migration script, but it's not necessary for BC.

@marijnkampf added, thanks!

edit: So the "proposal" now says:

We don't commit to any target Framework versions for this feature. However we would like to submit this against "3.x" if we can resolve all breaking changes by introducing a Simple APL.

The "backwards-incompatible changes" chapter has also been tweaked accordingly.

edit2: Even better wording for the "proposal":

We would endeavour to write a backwards-compatible solution that could be included in a minor release of 3.x. Once the solution is implemented we would open it up for review and get consensus that it was backwards compatible. If not, then the change would be incorporated into 4.x instead.

stevie-mayhew commented 9 years ago

Yes that makes complete sense. I think that having a "target version" for a RFC is a good idea, but as you say it shouldn't preclude a version bump if the end result becomes overly complex or obtuse due to attempts to keep within that constraint.

sminnee commented 9 years ago

I think that schema changes are okay for minor releases, as long as running dev/build takes care of everything for you, as that is understood as a standard silverstripe post-update action.

I'd also expect that you don't need to introduce any new configuration, which would mean that, if no asset backend configuration is included, the system would assume that you're using a simple asset backend, pointing to the assets directory, and using a PHP 5.3-compatible non-Flysystem filestore—i'd probably recommend a class that deals only with local files and conforms to Flysystem API. Whether or not it's worth going to that effort to get this into 3.x is up to the opinion of the implementor.

It's still "dangerous" and I think the implementation would need broad, rigorous scrutiny to see if it was actually viable for 3.x, and unless we got consensus that it was, it would go into 4.0.

I'd like to see it in 3.x, but it's much more important that we honour our commitment to semantic versioning.

sminnee commented 9 years ago

I think the parameters defiend in the APL interface are too rigid and don't leave us future flexibility without irritating API changes. The obvious way to solve it is using a tuple object (which also makes this solution more customisable) but I see this option was rejected. I think this should be reconsidered.

The issue isn't so much the tuple object itself as the illusion of flexibility that it would provide. Changing the parameters that get passed to uniquely identify an assets will require that all callers and backends get updated. Under the current model, this will be obvious—method signatures will change for the callers, and the interface that backends implement will change, causing parse-time errors. If we have a tuple, things might still work on the surface, but there may bugs may appear—and since it's do to with asset persistence, these could be data-loss causing bugs.

A tuple object would make the method signatures a bit more DRY, so maybe there are ways of addressing my concern while still using tuples. Essentially, you need to ensure that the code chokes quickly until it's updated to support the new tuple format, which is a little counter-intuitive, as I'm essentially asking that we make upgrading more difficult. ;-)

nyeholt commented 9 years ago

So I had a draft written up on the mailing list thread that I never actually hit 'send' on, so will paste below. As mentioned in the RFC there's been some discussion on this previously, but we've actually built a working set of modules (and the s3 layer for it) that meets a significant portion of the requirements. It was built as a module with an eye on replacing the File/Folder implementation completely, but the scope of re-writing all that too would have pushed past the deadlines we had at the time.

The RFC in general has quite a lot of similarities to work already done (file reference identifier, an APL type service, underlying file operation abstraction (via flysystem, which unfortunately wasn't around when I first started on my stuff!) so there's not much I can fault with the high-level design (except of course version requirement - PHP5.4 & SS4.0 seem much more realistic!). The APL interface does seem a little awkward to me though, but I'll need to digest things a bit more before commenting further on that.

Anyway, the following are some points based on the work already done

S3Bucket:||g4a/g4af3204etc/filename.jpg

But if I've also got something uploaded separately to a path in that same S3 location (eg other/files/generated/by/other/application/file.jpg) I can simply reference it like

S3Bucket:||other/files/generated/by/other/application/file.jpg

This kind of referencing is very handy for the following point

// in this case, the $sservice figures out what S3Bucket means)
$reader = $service->findReaderFor('S3Bucket:||other/files/generated/by/other/application/file.jpg'); 

$writer = new FilesystemWriter('some/path/locally.jpg');
$writer->write($reader);
  S3Service:
    constructor:
      key: {my_key}
      secret: {secret}
      region: ap-southeast-2
  S3ContentReader:
    type: prototype
    properties:
      s3service: %$S3Service
      bucket: public-content
      baseUrl: https://public-content.s3-ap-southeast-2.amazonaws.com
  S3ContentWriter:
    type: prototype
    properties:
      s3service: %$S3Service
      bucket: public-content
      baseUrl: https://public-content.s3-ap-southeast-2.amazonaws.com
  PrivateS3ContentReader:
    type: prototype
    properties:
      s3service: %$S3Service
      bucket: private-content
      baseUrl: https://private-content.s3-ap-southeast-2.amazonaws.com
  PrivateS3ContentWriter:
    type: prototype
    properties:
      s3service: %$S3Service
      bucket: private-content
      baseUrl: https://private-content.s3-ap-southeast-2.amazonaws.com
  ContentService:
    properties:
      stores:
        File:
          ContentReader: FileContentReader
          ContentWriter: FileContentWriter
        S3DevBucket:
          ContentReader: S3ContentReader
          ContentWriter: S3ContentWriter
        PrivateBucket:
          ContentReader: PrivateS3ContentReader
          ContentWriter: PrivateS3ContentWriter
oddnoc commented 9 years ago

@mateusz Thanks for the followup. The approach of targeting 3.x if and only if backward compatibility is retained is completely acceptable.

chillu commented 9 years ago

Wow, massive effort guys - and very well written! I largely agree with the proposal, just some curve balls to consider below.

Would you consider a third party DMS to be a viable APL backend? It would complicate the APL in terms of inferring a folder hierarchy based on externally stored data (rather than Folder), and require the notion of a readonly filesystem. I assume that its out of scope and should rather be supported by a more generic file picker for the CMS, but want to make sure we've got our bases covered.

Related to this: We're assuming that SilverStripe is the only consumer of a specific file storage service, right? Given that organisational data (hierarchies, tags) is only optionally stored in the local database File, Folder), external data essentially becomes anonymous hashed blobs. Makes it harder for other systems to access the same external data in a meaningful way, they'll have to go through some SilverStripe web service.

Sharded stores: One use case could be different backup policies in an S3 store. One bucket might be used for small assets hooked up to CloudFront, another bucket for more important documents which are backed up to Glacier. While this could be solved through cascading backends, one of these will always carry the performance penalty of the failover check. It'll add significant API surface, so not sure its worth the effort.

Do you see a use for ParentObject outside of the built-in Folder structure? Nothing in core could rely on it since the object would be implementation specific (e.g. an S3 folder reference).

I'm aware that the APL interface is an MVP at the moment, so I'll hold off asking about stream support and such.

mateusz commented 9 years ago

@nyeholt Thanks for posting, happy to hear you find the high-level design watertight :-)

I must admit I'm struggling to confidently build a mental picture of your architecture just from the code so I can't appreciate your design in full. Maybe you would like to write up a short architecture document, i.e. a mini-RFC, for your proposition?

Can you confirm my understanding though: your solution seems less abstract than this RFC in that it works only with the storage backends, and does not define any intermediate layer to decouple file manipulation and representation from the core (i.e. bottom-up approach). So we'd need to come up with an appropriate abstraction for the Framework to encapsulate all the backend operations anyway (and we would end up with something like the APL).

This RFC comes from the opposite perspective: it works out the intermediate layer's interface, but does not define anything about the backends (i.e. top-down approach). It seems to me that you would be able to implement your proposition as a concrete APL, would you not?

Now to the specific points raised.

The Model level representation (Folder / File objects) shouldn't dictate the file system representation

Correct, it doesn't. In this RFC the storage is decoupled completely. You can build new data object structure based on the APL interface without using Folder & Files. APL doesn't care about the Folders either - it uses the ParentObject as a generalised reference to a container.

File operations (eg resampling, doc conversions etc) need to either a) have a simple way to copy the item locally, then store back in the CDN, in a way that's simple to look up again

Yes, you use the "Variant" field of the tuple to store variants of a file. In other words, the APL is aware of the variants and may optimise for it.

It's very handy to be able to make use of existing remote objects via the same API

This comes at a cost of having to know the storage details of files, and hence at a cost of coupling the storage representation to your code. Nevertheless no-one will stop people from bypassing the APL and interacting directly with the backends, perhaps with some nominal support from the APL (i.e. getBackendIdentifier() to fetch the internal identifier).

It should also be simple to copy between items, regardless of file backend

This can be done within the concrete APL, we don't limit what functionality the APL provides, but it needs to be encapsulated within.

It would be really good if 'secure' file management was a native part of things

As above, this is up to the APL implementation. It seems better not to force the solution to natively use "secureBackend" and "insecureBackend", because some APLs might be able to support security within a single backend.

I encode the specific backend store that an item is located in the filepointer

Again, this couples the storage representation to the Framework.

Thanks again for raising concerns :-)

mateusz commented 9 years ago

@chillu Welcome to the RFC and thanks for helping to bullet-proof the solution ;-)

Would you consider a third party DMS to be a viable APL backend.

Maybe I'm missing something, but I don't see any problem with writing a DMS-supporting APL. The APL interface is "directory-agnostic" (see "Handling directories" chapter). As long as you can describe your file objects via the tuple, you can build the APL however you want. For example the Simple APL will infer the directory tree from the ParentObjects, but an S3 APL would perhaps ignore the directory tree and work only with Hashes.

For readonly, how about throwing Exceptions from the write operations?

We're assuming that SilverStripe is the only consumer of a specific file storage service

Kind of. You are right in assuming the DB is the canonical store for the file metadata. However, since the APL specifics are undefined, you are free to store this metadata in your backend and support other clients in this way. This means you will need to handle the synchronisation issues (which would land you similar problems as with the current asset system, but it would at least be incapsulated withint the APL this time).

On the practical side, you should be able to derive all the information you need from the tuple (specifically, the ParentObject).

Sharded stores

I think this can still be implemented within the APL. You can for example derive the shard ID from the Filename. The only question that remains in my mind is do we need to have some hinting mechanism to tell the APL which backend we prefer (i.e. is there enough data in the tuple for the APL to derive the appropriate backend, or we need some more information).

Do you see a use for ParentObject outside of the built-in Folder structure?

Yes, my assumption is ParentObject can be any Object representing the context (or can be empty). The abstract APL doesn't care about the Folders. The Folder is a construct used by File & Images to structure the files in the CMS via the database. A concrete APL might store files in a similar tree on the filesystem (read up on Simple APL), but it does not guarantee to do so.

The corollary is you can build your own Files & Images section where there is no folders, or folders are in reverse order :-) APL doesn't care.

Stream support

My knowledge is vague here. I assume this is for serving video streams? If yes, then that should work fine because, again, APL is abstract, and the URLs don't need to point back to SilverStripe at all.

Say you link a video resource into your CMS Content. A shortcode will be injected. Then when a user requests the page, APL::getAsURL gets called based on the tuple from the shortcode. APL verifies that this tuple is a streamable resource (i.e. by looking at the extension in Filename, or inspecting ParentObject), and is free to return a direct URL to a box with streaming capability.

Uff. I hope that answers your questions? :-) Keep asking if you have anything more.

hafriedlander commented 9 years ago

@chillu One thing this solution doesn't do at all is deal with the case where some other file store wants to be the truth about what files exist and have SilverStripe be a client of that truth. So a DMS could be used as the backend store, but there's no support for having SilverStripe interrogate that store to find out what files exist. In fact, it makes that slightly harder, because it removes the single "all files live here" property that Files & Images used to have.

nyeholt commented 9 years ago

Maybe you would like to write up a short architecture document, i.e. a mini-RFC, for your proposition?

Yep, it's much simpler than you're possibly guessing - it'll be worth putting together even if just to highlight the kinds of things I'd like to be able to do... some of which I'll put inline below :). I think the main bit that makes it confusing is that the cdn-content module is working within the framework's extension points, rather than being cleanly defined APIs - and that there's a lot of stuff in there to translate between raw file i/o operations and the API abstractions.

works only with the storage backends, and does not define any intermediate layer to decouple file manipulation and representation from the core

The Framework still makes use of File objects for assets, but effectively the only way between the framework and the backend is the file reference (eg Label:||identifier/string/name.jpg) and the ContentService; the content service chooses wrapper objects representing 'read' and 'write' actions around the specific backend based on the Label part of the reference; these wrappers then defer to the specific backends for actually performing operations. So to get the URL for an actual file object;


$file = File::get()->byID(2);
$reader = $file->getReader();
$url = $reader->getURL();

or, based on a FileContent db field on a data object


class MyObject {
    $db = array(
        'MyFile'    => 'FileContent',
    );

    public function FileLink() {
        $reader = $this->MyFile->getURL();
        return $reader->getURL();
    }   
}

or, just from an identifier (which is what's under the cover of the above two)

$identifier = 'NamedSource:||awef/seg35w/file.jpg';
$reader = $contentService->getReader($identifier);
$url = $reader->getURL();

So from a framework level, the separation is abstracted at the ContentService point; at that level, you're dealing just with string identifiers going in, and receiving read/write wrappers back (implementers of underlying file I/O functionality). To my understanding, where the RFC has the APL as the abstraction point, in the content-services module that is explicitly split across 3 separate abstractions; the ContentService deals with finding an appropriate backend for a given reference, and is responsible for handling things like sharding, (and implementation could be swapped to handle cascading lookups), whereas the Reader/Writer pair are responsible for exposing the actual operations that can be performed on a file.

From my read of things, the equivalent to the ContentService type of structure is a hierarchical APL type of structure, where the AssetPersistenceLayer retrieved from DI would be an instance of something like SelectivePersistenceLayer, where its getAsString and setAsString methods would then load up the specific instances of a ContentReader/Writer and use those to hook into the relevant backend layers to retrieve bits and pieces.

Part of my motivation for having the explicitly separate interfaces is so that user-land code can operate on content items with a consistent API (see further below).

hence at a cost of coupling the storage representation to your code ... perhaps with some nominal support from the APL (i.e. getBackendIdentifier() to fetch the internal identifier

Yep, that would be as far as I'd expect things to be; the main thing I'd be hoping for is that I could load into a File asset using a consistent API without having to translate between raw operations. eg


// passed in via config/form or something
$remoteItemId = 'RemoteStore:||known/path.jpg';

$remoteReader = $contentService->getReader($remoteItemId);

$file = File::get()->byID(2);
$file->getWriter()->write($remoteReader);

But it's not just paths, the same API should hold for raw content


$file = File::get()->byID(2);
$file->getWriter()->write(new RawContentReader("some kind of string here"));
$file->getWriter()->write(new StreamContentReader(fopen('http://hax.ru/totally_legit.php')));

And to cover Hamish's point about remote content stores making content available, it'd be really nice if the API would provide this capability, if for no other reason than importing from that remote store. I've got the following in place in the content-services module, though it's not completely thought through just yet.


// the remote folder ID is a known structure in the remote store, passed in via config/form or something
$remoteFolderId = 'RemoteStore:||known/path';

$remoteFolder = $contentService->getReader($remoteFolderId);
foreach ($remoteFolder->getList() as $remoteReader) {
    $file = File::create();
    $file->Title = basename($remoteReader->getId());
    $file->getWriter()->write($remoteReader);
    $file->write();
}

This can be done within the concrete APL, we don't limit what functionality the APL provides, but it needs to be encapsulated within.

True, my point is that I'd like to see interfaces defined in such a way that this is mandated so that user-land code has a consistent understanding of how to copy a file from one location to another (as in the examples above).

As above, this is up to the APL implementation. It seems better not to force the solution to natively use "secureBackend" and "insecureBackend", because some APLs might be able to support security within a single backend.

Agreed, I guess my point is more that the 'problem' of secure files should be a framework handled thing, rather than a module necessarily.

Again, this couples the storage representation to the Framework.

Not really - by specific backend store, what I mean is a label that points at something configured via DI, which could be actually be an instance of Anything, and interpreted in any way that Anything chooses. So while S3Bucket:||aw3/aw3eshe34/file.jpg might initially be configured to dump into an S3 store somewhere using the S3ContentReader/Writer implementation, I might at some point copy those files to a local directory and choose to instead configure S3Bucket to use the FileContentReader/Writer implementation. Or I might configure S3Bucket to use a CascadeReader/Writer pair that first looks in local filesystem, falling back to S3 if it can't find it.

Your comment in response to Ingo's Sharded stores question is essentially this implementation, wherein the shard info is encoded in the identifier. The ContentService interprets this shard information to find a backend that will handle the Label part of the identifier.

Anyway - will try and put together a diagram and the specific interfaces separately and link them a bit later today.

spronkey commented 9 years ago

Please take my comments with a grain of salt, as I think the proposal as written would suit most use cases however I do have a few things to note.

First, this is a very complex proposal, that ends up actually not solving much more than, for example, micmania1's pull request. Complexity is added to support versioning and workflow, but then they are out of the scope of this RFC. I also wonder whether these would be better solved in other ways? Versioned files, for example, are possible to retrofit (in a way) within the current framework - e.g. by introducing an intermediary data object that represents the versioned file, and just storing a new file for every version.

I don't really see why swapping to a strange mashed up "File Reference" field is a positive step - it's just screaming out to be an object in its own right, stored in its own table. Solving the versioning and workflow on DataObjects in general would then be extendable to solve versioning of files. Unless I'm just not getting it?

On the 5.4 issue with Flysystem, why even bother looking at 5.3 compatibility? It's just going to be a waste of time, and it seems like the fact that Flysystem requires 5.4 should make the decision easy to go with SS4.x and bump version requirements.

nyeholt commented 9 years ago

And the separate architecture diagram / interface description at https://github.com/silverstripe-australia/silverstripe-content-services/wiki/Architecture

You'll notice that the diagram is very familiar... :)

mateusz commented 9 years ago

@nyeholt nice diagram, I've seen it somewhere ;-)

I think the biggest issue is that you are ignoring the versioning. Native versioning in the Framework is an important design goal for us, a strategic choice if you will. To make files first-class data types, they must support this natively as well. Otherwise files might as well remain DataObjects, and we could use something akin to @micmania1's solution.

Your file identifier assumes that the store+path+filename describes the file uniquely - but that's only true in a operating system context. Framework also needs to handle the time component - a more appropriate mental model here is git repo, not a filesystem. Effectively, you are making the Framework bow to the requirements of a classical unix file system, instead of trying to make the FS serve the actual needs of the Framework.

So I'm pretty much arguing that the tuple stored in the DBField needs to contain enough information to identify a file uniquely, and that "your" file identifier hasn't got enough granularity - it lacks support for versioning and variants. @hafriedlander was initially arguing that the hash by itself is good enough tuple, but that was struck down because of lack of context for constructing the URLs.

So to summarise the proposed solutions so far:

micmania1 nyeholt this rfc
path X X
filename X X X
context* X X
variant X
hash X

*) you are right in saying that having a ParentObject reference is a hidden way of specifying the details of the store, which you accomplish by introducing a textual label.

If we can agree that the tuple presented in this RFC is necessary for our goals, then the only major difference that remains between your solution and this RFC (which can be easily seen by comparing the two diagrams) is how context is handled. You prefer to use injection via a string identifier (i.e. your label) and to have well-defined interface for readers and writers. We are more vague and just point back at the ParentObject, saying - figure it out on your own. Both approaches require people to decorate DataObjects, so it feels like this difference is just a matter of opinion :-)

So what remains is: do we agree on the tuple contents, and if we do not, then how do you propose the versioning is handled? If we can't reconcile our views, then the only remaining option is arbitrage of the competing proposals by the core-committers.

P.S. I don't know about everyone else out there, but this discussion really helps me to understand what problems are out there and how they can or cannot be solved. I hope it's as useful to you too!

nyeholt commented 9 years ago

Not ignoring versioning at all; the point of having the FileContent field stored as a string is to allow for this.

I think you're assuming that the file identifier string is being constructed based on a file system representation, which is not the case at all. A ContentWriter is responsible for creating a unique identifier for a given file (or make use of the default abstract implementation). The only requirement being that the identifier can be used to look the asset up later. I think the examples having / chars have given a false impression :) I'll add a couple examples later tonight that show how a file actually gets mapped from a few use cases; uploading a file, creating a variation, and storing a theme in a CDN. Which is another use case I had forgotten about until now 😂

mateusz commented 9 years ago

@spronkey thanks for commenting - I'm pretty much going to link other comments in here, because we already spoke about these things.

Like I wrote in the response to @nyeholt, versioning support is a critical feature of the Framework. Retrofitting versioning into the current model is what do at the moment (see silverstripe-versionedfiles module), and it is rather ugly, because we end up hacking the API by packing the version number into the Filename. What we are trying to do is to solve the root issue here, by making files first-class data types.

Regarding how the tuple is represented, see Sam's argument (search upwards for "The issue isn't so much the tuple object itself as the illusion of flexibility that it would provide.").

Aiming at 3.x just makes the implementation harder - and it's us doing the work, so why are you worried? :-) Community will have the final say in judging if the solution is backwards compatible before merging.

Regarding the Flysystem, I don't think we should get stuck at seeing it as the one and only solution - in all proposed solutions there is an added abstraction layer on top. We should definitely be thinking in broader terms :-)

mateusz commented 9 years ago

@nyeholt I see, yes I did assume that "your" file id was well defined. Would definitely be great to see how you solve all requirements listed in the motivation at the top.

mateusz commented 9 years ago

Oh yeah sorry @nyeholt - one more thing :-) with the RFC process, the author drives the development. If you make a proposition, it's assumed you have the resources implement it.

In other words, we don't need to all agree on the specifics of the solution. If this RFC is approved by the core-committers as "a solution", then we are free to implement it that way - because it's us doing the work. This is especially true if we only differ in opinions, not the substance :-)

spronkey commented 9 years ago

@mateusz My point re: the DBField thing - I'm trying to understand why FileReferenceField is a glorified string, and not a DataObject in its own right. You say this proposed "first class data type" is solving the root issue - but I see it as a dirty hack and a step backwards that's bound to cause headaches in the future.

DBField instances should represent single values - FileReferenceField is storing multiple values. Some RDMSs have denormalised type support (i.e. postgres and JSON type), but SS's ORM supports databases that don't (i.e. mysql), which tells me that FileReferenceField really should be a table on its own, and thus in SS represented as an object in its own right, with hash, variant, filename, parent object as its fields?

Someone above mentioned the case where systems other than SilverStripe will be accessing the files - I think this is worth further consideration. With the FileReferenceField, there's a big impediment to using raw SQL to access the files, and hierarchy information stored in the database - one has to pull apart a string. Using a DataObject for this removes this impediment completely.

Re: versioned files - if FileReferenceField was an object in its own right, versioning could be taken care of in the same way that SiteTree versioning is. The only catch is that some additional handling would have to take place when files are stored, to ensure that versions remain distinct.

Sam's comments around the tuple object were referring to using a tuple as the params to the APL API, not the storage of file information in a denormalised DBField tuple. I agree with Sam's comments re: not using a tuple for APL API, and using distinct params as proposed. If the API changes there should be a clear notice to the API consumer.

hafriedlander commented 9 years ago

@spronkey http://api.silverstripe.org/3.1/class-CompositeField.html - each element of the tuple would be it's own column in the table, but there is no reason we need to, or should, have a single field represent a single column.

hafriedlander commented 9 years ago

Sorry, I meant http://api.silverstripe.org/3.0/class-CompositeDBField.html

nyeholt commented 9 years ago

Ended up getting tied up all weekend, so will respond in a bit more detail from work. Regarding

In other words, we don't need to all agree on the specifics of the solution. If this RFC is approved by the core-committers as "a solution", then we are free to implement it that way - because it's us doing the work. This is especially true if we only differ in opinions, not the substance :-)

My main reason for going through this thoroughly is because I have an existing implementation that has been built with the same goals in mind, and want to make sure I understand how the proposed API will work under the use cases I've implemented or have in mind - I'm quite happy to dump a bunch of code for a better and more widely maintained alternative :).

A couple of thoughts on the APL interface

nyeholt commented 9 years ago

To address the main motivation points, things are handled much as decided upon in the RFC - the key thing being the file reference / tuple to uniquely identify a file without being tied to filesystem / hierarchical restrictions.

Currently implemented as an extension on Folder - a StoreInCDN field allows a user to select which storage engine should be used for storing all its contained files (property is inherited down the folder tree). As SS doesn't define a 'root' folder object, the absolute root doesn't have a setting for this at present - live implementations using the module have the top level folders set appropriately.

s3cdn module

s3cdn is an example of this

An extension to File defines the CDNFile field, which is of type FileContent, which is stored as a varchar in the DB against the file record. I haven't actually gone through updating the File&Assets UI to make it usable, but the theory being that as the identifier is a property of the DB record, the standard Versioned extension should be able to handle the determination of which file gets looked up.

It's actually not much different from now where

I actually did a bit of work recently on the advancedworkflow module to at least allow a workflow be applied to a Folder :D Doesn't do much without 'proper' versioning, but functionality wise, once Versioned is working, workflow will behave as per page content.

By using the FileContent field, the file reference can be stored directly as a field against the data object (as per the RFC). However the existing implementation is layered over existing UploadField functionality, so uploads to a file that's then accessible via Files & Images.

So, going back to the point of the RFC, I think the crux of the matter is that assets are re-mapped to another form of identifier (ie the Tuple), and that there is an interface through which content can be looked up or set making use of that, which the APL provides. Given this is the basis of what I identified with the work I've done, I'll try and avoid any further distraction from the point of this thread - the RFC identifies the core point of solving the overall problem, and defines an interface to be able to start working towards that solution :+1:

mateusz commented 9 years ago

Cool, then that sounds to me like we have resolved or at least addressed all of the concerns that were raised :-) Thank you all for your contributing your valuable time!

Time to submit to core-committers.

nyeholt commented 9 years ago

There was still a few questions interspersed above, have condensed below

One last thing, relating to the order of arguments in the API; I would rather write code such as


$tuple = $apl->setAsString('Some text to be stored');

without providing any other bits of information, and receive back a file identifier. I don't care where or how it is stored, just that it is.

The other thing I'd like to see is for the setAsString method to handle different inputs; it may not make sense to call it setAsString in that sense, maybe setContent / setAssetContent?


// file.txt exists, so its content is used 
$tuple = $apl->setAsString('/path/to/file.txt');

// passing in a resource, which may be a large file somewhere
$tuple = $apl->setAsString(fopen('http://some.where/video.mp4'));

// if the file doesn't exist, we're assuming a string literal
$tuple = $apl->setAsString('/path/to/non_existent_file.txt');

// this could actually handle copy operations - internally a call would be
// made to $apl->getAsString($someOtherTuple['hash']
$tuple = $apl->setAsString($someOtherTuple);
tractorcow commented 9 years ago

I would also suggest simply naming the class FileReference over FileReferenceField, since it follows the naming convention of other DB fields, and is shorter. FormFields, on the other hand, have the Field suffix. :)

kinglozzer commented 9 years ago

Arrives fashionably late to the party.

Firstly, I think this RFC is awesome: it’s really well structured and written, and already answered a few questions I had from reading the Google Group discussion. Great work!

I share the same concern/query as @dhensby, supporting multiple backends on a single site, and just wanted to clarify. Example scenario: a site uses the “old” SimpleBackend for the CMS and DataObjects, except for one model which integrates with some imaginary third-party API and uses S3 for asset storage. To solve this, are you suggesting that you’d write a single APL which would then derive which backend to use from the ParentObject?

The other queries I had were around derived files. I’m trying to frame things step by step: once the user requests a derived file via something like $fileReference->derive("Resized640x480")->getAsURL(), a clone of the FileReferenceField is created and returned with the Variation parameter set. getAsURL() is called on this new object - we then have to check for the file’s existence (possibly check across multiple cascading backends?), if it doesn’t exist we have to realise that the file is a derivative and download it, resample/derive, and store it in the backend. Would this all be bundled into the APL implementation, or parts of it handled in FileReferenceField? Might it make sense to have a generic exists() method in the APL, or would we just do something like if($apl->getAsURL() !== null)?

Note the APL should not change tuple values for derived files

Derived files may have different filenames, should the tuple values not reflect this? Or is this to always be calculated “on the fly” by the APL?

On target SilverStripe version: is there any convincing reason not to put this in 4? I think it’s going to be very hard to guarantee 100% backward-compatibility - there’s probably someone, somewhere out there relying on the File db schema that we’re proposing to change. Why bother targeting 3 and making it fully backward compatible when user code will need a huge overhaul to actually take advantage of this anyway?

mateusz commented 9 years ago

@nyeholt Regarding additional operations - I've added a note in the Appendix A to make it more clear that this is a preliminary interface. I agree we might need some addtional operations, but not all of them will be available on all APLs. Maybe there could be a set of interfaces that work together? Say: APLBase (for base commands), APLDeletions, APLCopying.

last thing, relating to the order of arguments in the API

Hmm not sure if I follow you. In my mind this is not possible: the tuple defines the "slot" where you are expecting the file should live, and the APL doesn't have a way to work these parameters out by itself. For example how would it find the filename or a variant just from the content?

The other thing I'd like to see is for the setAsString method to handle different inputs

Making the setAsString parameter of unspecified type is a similar argument as making the tuple a completely flexible object: the interface becomes fuzzy. We can probably add more operations though if you want? Such as setFromStream, and setFromTuple?

I see there being merit as being able to reference a folder for certain operations

The source of truth for files is in the DB, so normally you wouldn't ask for available tuples. It's not in the interface, but you can add reflection methods if you need. You can even define the APLReflecting interface if you want :-)

mateusz commented 9 years ago

@tractorcow changed - I'm all for consistency.

mateusz commented 9 years ago

Hi @kinglozzer , welcome :-)

supporting multiple backends on a single site

We started calling this "shards". Yes, you are correct, this is derived from the tuple by the APL. All file operations in the framework will use a single APL, so you need to choose/write an APL that supports sharding. I was imagining such an APL would come with a decorator for DataObjects to implement ->getShard() operation, and then from your perspective you would apply that to File and to your model and implement the getShard().

If you don't like that, then you could also instantiate and manage your own APL, as long as you are staying outside of the File hierarchy.

derived files

This responsibility will stay in the Image - neither FileReference nor the APL will handle processing. FileReference will only do the generation of the Variant string, and APL provides additional slots for variants - that's it.

Yeah, we will need some way of checking if the generated file already exists - have updated the "derived files" section with new pseudocode, and added a note to the Appendix A about "exists". I don't know how exactly that will be done yet, but I agree we must solve this :-)

Derived files may have different filenames

Incorrect - derived files have different tuple - the Filename part stays as it is, but the Variant part of the tuple changes. In effect, the APL only provides additional "slots" to store variants without messing around with the Filenames. The Variant will be included in the URL somehow but it's not dictated how :-)

APL does not know how to produce derived tuples - this will be FileReference responsibility and it will be done "on the fly". Neither does APL know how to process the content - this stays Images responsibility.

target SilverStripe version

I stated my view already, but just realised it's missing rationale! For me, the driver behind targeting 3.x is that this is what I need it for the project I'm working on. I think it's fair to allow this, if it's not causing harm to people - because this is part of the driver for us actually doing this work in the first place.

If you step into my shoes - imagine you have a project where you need features X and Y. You make a proposal for that feature and put up your hand to develop that. But the community thinks only X is a good idea, and Y is not required, and therefore it shouldn't be included. From your perspective - you lose all interest in doing this work because you need both X and Y. So the feature X never gets done either, unless someone else picks it up :-)

Ok, with all honesty, it's not THAT dramatic - I am simply more willing to do the work if we target 3.1. Our alternative is to make and maintain a Framework fork, but I'd personally much prefer to put the effort into making this feature backwards-compatible rather than doing the mindless labour of maintaining a fork.

I think this is something that the core-committers should resolve - would be great to hear from everyone else :-)

nyeholt commented 9 years ago

Hmm not sure if I follow you. In my mind this is not possible: the tuple defines the "slot" where you are expecting the file should live, and the APL doesn't have a way to work these parameters out by itself. For example how would it find the filename or a variant just from the content?

In many cases you do, a method such as setFromLocalFile has some of the information - or if the interface was expanded to support multiple input types eg setFromStream / setFromTuple. Even in the case of calling "store this content for me, I don't care about a name just give me a pointer for future reference", a unique generated name could be created to handle the case.

In any case you're already mentioning it above

A special case of a tuple rewrite is when a caller passes "null" as one of the tuple values. Caller can then expect the APL implementation to generate the value. This would be used when creating new files to obtain a hash.

I'm just making the argument that you should be able to pass "null" for all the tuple values and the APL implementation should generate it. In which case it makes sense to have the content as the first parameter.

Making the setAsString parameter of unspecified type is a similar argument as making the tuple a completely flexible object: the interface becomes fuzzy. We can probably add more operations though if you want? Such as setFromStream, and setFromTuple?

It's an area where PHP's lack of polymorphism bites; to my mind, having a single method to call at the service layer for the operation simplifies things greatly, as it means 'my' code, regardless of what I'm working with, doesn't have to worry about which method to call on the service. To me, having a service interface with setFromXXX seems cluttered and annoying to use as an end consumer, but it's more opinion than a design fault :).

The source of truth for files is in the DB, so normally you wouldn't ask for available tuples. It's not in the interface, but you can add reflection methods if you need. You can even define the APLReflecting interface if you want :-)

The source of truth for SilverStripe's File objects is the DB. I guess my thinking here is that if all of SilverStripe's File interactions are going to move to using a tuple representation, it'd be really good if I could make use of the same/similar APIs to have a tuple representation of files from other sources so that they can then be passed through SilverStripe in a structure that is consistent regardless of the actual source of the file. But as I mentioned, folder iteration is an extension to the API that's not necessary for what the RFC is describing.

willmorgan commented 9 years ago

I am simply more willing to do the work if we target 3.1

I think 4.x makes more sense, tbh. If we are doing semver properly, anyway (excuse my brain fart in suggesting we drop 5.3 in the next 3.x). Sadly maintaining a separate branch, especially with an RFC that's going to be full of API changes, is a pain... but if you're willing to do the BC changes so it'll go in to at least 3.x anyway, then I can't see why it shouldn't go into 4.0 in the first instance.

Generally in agreement with @nyeholt regarding setting content via other methods, as limiting to setting content via strings sounds like it would be a problem for memory consumption, etc.

I've had a think about a FileTuple vs the proposed, argument based solution. I think that using an object and ensuring it returns a valid tuple formatted string on __toString would be a nice halfway house. But on the other hand, I'm having trouble finding any sort of operation that would need to be extended from the original tuple object. It seems it's purely for syntax sugar, so perhaps we should look at ensuring the FileReference object is easily extensible by Injector instead.

If we code to interfaces, future versions would trigger warnings on API changes, so I think that covers @sminnee's concerns.

willmorgan commented 9 years ago

More blue sky thinking :trollface:

I think it would be nice if this piece of work (and future features) was packaged in a separate component like Symfony's components are... maybe as a module. Perhaps that would make backporting a bit easier?

If it's a brand new DBField, then I don't think it would be too painful. From my experience using namespaces in isolated circumstances in 3.x, you could keep File/Folder/Image pretty separate from the soon-to-be legacy versions.

tractorcow commented 9 years ago

I think 4.x makes more sense, tbh.

I also agree, since it adds the extra level of complexity required to maintain BC, and that additional level of doubt as to whether we've adequately achieved this. In my mind this can be effectively minimised, but never fully eliminated. :)

If we code to interfaces

I think interfaced and developing to the API will definitely become a bigger part of future development of the framework, as we better develop our commitment to semver. :cake: It'll help keep us honest, so to speak.

I think it would be nice if this piece of work (and future features) was packaged in a separate component like Symfony's components are... maybe as a module. Perhaps that would make backporting a bit easier?

Like we did with the siteconfig stuff, I think it might be an idea to look at moving some of the default (but not mandatory) functionality to a separate module. With composer's provides it would better support alternate backends meeting the necessary dependencies with alternate implementation. What do you think?

nyeholt commented 9 years ago

I think it would be nice if this piece of work (and future features) was packaged in a separate component like Symfony's components are... maybe as a module. Perhaps that would make backporting a bit easier?

Not necessarily make backporting easier, but it would definitely make the process of providing alternative implementations better. I'm quite keen to write a concrete APL implementation to wrap around the content-services module to give a proper testbed as the core changes are being made. Once the core changes are made, more specific APL implementations can then be written.

I think 4.x makes more sense, tbh.

Realistically there's no way a change of this scale can be made within a minor release. Yes, it's possible a 'current stable' compatible layer could be made, and that keeping in the bounds of framework and cms it would not cause any b/c problems, but with the number of modules and custom code out there that do file related things, it's almost guaranteed things will break. Part of the problem being that lots of code relies on the lack of restriction or structure around how files are currently managed.

Trying to stay compatible with 3.2 will cause compromises to be made as to how the APL is implemented, instead of time being spent on making it as good as it could be - the content-services module started down the path of being implemented as a full File/Folder replacement, but it quickly became apparent that it'd be impossible to do so prior to a major release unless we made some compromises, hence why it's a module that hooks in to the side instead.

mateusz commented 9 years ago

I'll sit back and try to facilitate this somehow - it's time for other core committers to speak up. Calling @hafriedlander @sminnee @tractorcow @dhensby @chillu @kinglozzer @halkyon @stojg @willmorgan @wilr :telephone_receiver:

So, dear core-committers, I'd like to ask you to treat this RFC as a unit. As a reminder, try to focus on raise critical issues with constructive basis first - i.e. if you think the solution will completely fail for some reason (e.g. the targeted version does looks like this kind of thing). Try to specify exactly what change you are pushing for - including a code example. If you are simply happy with the solution, say so :-)

If you want to talk something over or get more information, I'm happy to Skype or meet you in person. There will be a meetup talk next Wed, so come to that. There is also our hangout coming up on 3rd Feb.

My position

Thanks all :-)

wilr commented 9 years ago

@mateusz Nice work leading off the RFC process.

The File and Folder would retain their APIs as much as possible, but be retrofitted to use the FileReference. They would now conceptually become an inseparable part of the Files & Images subsystem of the CMS.

More of a reason for 4.0 target release. File / Folder is currently Framework only. Moving to the CMS is a major API change. Should be done properly.

We will most likely use the Flysystem as a backend for this APL - although we will need to resolve the Flysystem's PHP 5.3 incompatibility somehow (we want this change to be submitted against the 3.x branch).

Agree with others, not worth the effort. Would mean likely forking Flysystem which I'd like to avoid and make 4.0, 5.4 only.

These will need to be rewritten to use shortcodes. Direct file references by path are no longer valid.

Would be interesting to know what the shortcode within the TinyMCE interface would reference since you don't have that File object to refer too (HtmlEditorField is part of framework / File will be CMS only right?) and how that would look.

mateusz commented 9 years ago

@nyeholt, for clarity, I have updated the RFC with some code snippets in "FileReference" chapter, and added a new Appendix to cover off the FileReference API. APL interface already permits nulls on all tuple fields, and if you look at Appendix C, you will see that this implementation handles all-null tuples :-)

@willr I'm not sure if we would be moving the Files&Images to the CMS as part of this RFC - it would fit there better, but I think it could be too much work to do in one go. If the Files&Images move to the cms module, then the HtmlEditorField will also either need to move there, or lose the ability to embed files and images. So the shortcode could still actually reference the File.

We can maybe be more generic though, how about:

[file,parent_class=Image,parent_id=7,field_name=AttachedFile,variant=Resized640x480]

Then the shortcode subsystem would parse it like this:

$parent = DataObject::get($parent_class)->byID($parent_id);
$derivedReference = $parent->obj($field_name)->derive($variant);
return $derivedReference->getAsURL();

Would that work?

mateusz commented 9 years ago

I figured out I will attempt to keep track of the current state of the consensus.

willmorgan commented 9 years ago

With composer's provides it would better support alternate backends meeting the necessary dependencies with alternate implementation. What do you think?

@tractorcow I'm for this, though we would need to see through another RFC that makes Composer mandatory. Without deviating too much, while we're making sweeping changes to functionality, dropping proprietary autoloading and using Composer's classmap / PSR-4 functionality would shrink the codebase ;)

stojg commented 9 years ago

No comments on the solution. I will most likely have opinions about naming of things, but those comments will be added to the PR.

Prefered version is 4.x

chillu commented 9 years ago

Regarding short codes for [file] references, do you assume the APL backends cache the getAsURL() data? Or would this be the responsibility of the consumer, in this case HTMLText?

I'm still a bit confused on how non-local filesystems like S3 would actually store the file uniquely. If the Hash tuple value is an actual hash of the base content, it would change with every file modification. The file name on the other hand isn't unique enough: Multiple files with the same name but different content can be sent to the APL from different Folder containers. Or you'll try to attach a different logo.png for different company DataObjects. From a users perspective, neither of these should case a conflict, right? So would the S3 filename be comprised of information from all four tuples? Or do you use the (globally unique) content hash and somehow rename the S3 key through the APL on modification?

The "Handling directories" section and the APL Appendix describe a "ParentObject" as a generic object. Does this always refer to a DataObject? Assuming we want any APL implementation to consume details from this object, we need to either create an class interface for it, or standardise on an existing "interface" like DataObject. In case its always a DataObject, ClassName + ID would provide you a unique identifier. Or in case its a File, you could check the Folder parent and try to replicate the hierarchy in the APL storage (with the syncing caveats you mentioned).

Regarding my earlier questions about streams: I wasn't referring to video streams, but rather PHP streams. So content sent to the client without loading it completely into PHP memory (which the getAsString() and setAsString() seems to suggest). See https://github.com/thephpleague/flysystem#using-streams-for-reads-and-writes. Streams are particularly important for secure files where no public URL exists. I guess those are APL interface details which are yet to be determined.

dhensby commented 9 years ago

I think lots of my concerns have worked themselves out here.

I do think that targeting 3.x is a bad idea; the reason people are concerned about the effort into supporting 3.x is that we are all invested in SilverStripe and want to see the most efficient use of each others time to improve it. Plus, we don't want to have to be stuck in a discussion about whether it's "BC enough". The SemVer spec is very clear. If a public API breaks it is a new version. If the DB fields to store a file have changed, that's a very public API breaking...

Can we get some clarification on versioning or rather what exactly is versioned.

If a DO stores a tuple of a file and I want to link the same file to another DO. They'd have identical file tuples, correct? Then if I edited the file in the "files and images" interface, would it then edit the files on the objects that have that tuple? What if those objects are versioned, will it make a new draft version? What if I edit it via the DO directly, would it change on the other object too, would it create a new "version" of the file or just a new file?

How would a central filestore work if the tuples represent the files and have no central store?

mateusz commented 9 years ago

@chillu I have added some variants into the interface appendix to clarify this is not solidified yet (i.e. added setFromStream, getAsStream).

Regarding the ParentObject, you might be right, but I was thinking that we could just leave it be for now and see what people come up with, and only then standardise on an interface (interfaces?). For shortcodes, you need an underlying DataObject to get the FileReference, and you can find it by looking at the ID+ClassName in the shortcode, but that doesn't mean APL needs to use DataObject interface. For all we know it might identify pieces of data uniquely by Hash only. But yeah, DataObject or a subset of it's methods seems a natural choice.

I'm still a bit confused on how non-local filesystems like S3 would actually store the file uniquely.

Different tuple = different file from the Framework perspective. But from the APL perspective, as long as it can guarantee that it will provide the right data for the right tuple, we shouldn't need to care how it's done.

To ensure proper data matching, it might just use the hash+variant, or filename+ParentObject+variant, or it might return a modified tuple.

The bottom line is: two different files might actually be stored only once on S3, assuming that APL can find the right data piece, and there is more than one way to skin this particular cat.