Minimise frequency of webserver access to asset store

tractorcow commented 7 years ago

Acceptance critieria

Testing reveals that S3 access, while it works well, has severe performance issues when run as-is. We need to eliminate the distance that heavy content (file streams) has to travel in order to be useful, since these operations can be expensive.

In most execution caches (e.g. with a warm cache), the asset-admin UI needs to operate without webserver-mediated requests to the asset store. (But direct requests from the browser to the asset store, e.g. for getting the thumbnails themselves)
Image resampling on upload need to happen without an unnecessary fetch from the asset store.

PRs:

Proposed solution

We should address this by implementing an asset adapter cache wrapper, which can work around any other backend, and cache common calls such as:

has($path)
read($path)
readStream($path)
getTimestamp($path)
getSize($path)
getMetadata($path)
getMimetype($path)
getVisibility($path)

In order to ensure consistency in multi-server environment (which multiple caches may be present) I suggest to use the File dataobject as the source of truth where necessary, and allowing for some temporary inconsistency between the File dataobject and the stored asset (see https://github.com/silverstripe/silverstripe-framework/issues/5544)

In order to optimise further for performance, we should take advantage of the fact that files $paths are segmented by SHA and assume that, for instance, filesize of a given SHA won't ever change.

Details

A cache adaptor can be applied to either a public, private, or both adapters attached to a filesystem.
Cache adaptors can be nested (e.g. memory cache, file cache, failover).
Cache configuration is entirely optimiseable via yml config
A site using s3 as a backend does not experience unreasonable performance degradation over a site using direct file assets

Tasks

Implement abstract cache backend with pluggable cache + failover backend (flysystem interface)
Implement file based cache backend
Implement memory-based cache backend
Solution for caching of stream content (may be non-psr 6/16)
Build good default cache config
Document customisation of caching
Unit testing

Excluded

listContents($path) (directory listing, not utilised by AssetStore)
AWS specific implementation for cache backend

AWS Specific

See AWSMP-686 for backend cache provided on AWS.

Following up from:

sminnee commented 7 years ago

This makes sense for meta-data. For read() and readStream() I'd suggest looking more holistically at the best server architecture, and then ensure that API has the right elements to support that architecture.

Saying that assets are cache from S3 -> www boxes and then delivered to the user is one possible approach, but I'm sure that The Server People have other thoughts too.

tractorcow commented 7 years ago

This makes sense for meta-data. For read() and readStream() I'd suggest looking more holistically at the best server architecture, and then ensure that API has the right elements to support that architecture.

I think that having a multi-tiered cache makes sense here; At the top level you would have a memory cache for basic "small response" methods, but for physically reading file contents, it probably should fail to the secondary filesystem cache, which fails over itself to s3. Each server should be able to build the cache from these blocks based on what works best for its configuration.

chillu commented 7 years ago

Another way to approach this would be a transparent HTTP cache in the S3 module. It'd be per-server, but that can be solved by having a fast shared cache (e.g. Memcached) in multi-server setups - I wouldn't optimise our framework design on the current restrictions of SilverStripe Platform. A shared cache implementation is coming up soon in their backlog by the way.

Even with that cache, I assume the first access will be slow - do we need to populate the cache on upload?

How much of this delay comes from synchronous/sequential HTTP requests to S3 in the same SilverStripe HTTP request. To retrieve a "page" of assets in asset admin, it calls ReadFileQueryCreator and File::get(). To create those inlined thumbnails, it'll do a lot of HTTP calls, right? Maybe we can batch this one particular use case better? That could decrease the need for caching, and ensure that the first uncached response is fast as well.

sminnee commented 7 years ago

I think it'd be best to get a clear picture of the whole site architecture that you'd be going for, and then work out what bits in framework/assets-module you need (and what you'd need in the AWS implementation).

These ideas imply that assets should all be cloned back to the WWW boxes and served from there, which seems like a massive assumption, especially given that the total size of assets typically dwarfs everything else.

Do you need a huge chunk of EBS storage in addition to a huge chunk of S3 storage? What are the cost implications of this? What if that gets too small — do you need to leave lots of headroom in the EBS size, thereby throwing away one of the benefits of S3 storage (that you have less of a hard limit on asset volume).

tractorcow commented 7 years ago

These ideas imply that assets should all be cloned back to the WWW boxes and served from there,

Public assets should still be served from the s3 urls. we only clone back to the server if serving protected content. It would also need to be a limited lifetime cache with cron to flush.

sminnee commented 7 years ago

Where are these performance issues experienced? I think we need to reframe the acceptance criteria here to elaborate on the problems that we're trying to solve, as well as the technical solution proposed.

Some possibilities:

viewing of folders in asset admin
front-end access of private assets
front-end access of public assets

Is it all of those 3 or just some? Are there other key areas of pain?

tractorcow commented 7 years ago

Where are these performance issues experienced? I think we need to reframe the acceptance criteria here to elaborate on the problems that we're trying to solve, as well as the technical solution proposed.

Loading a list of images in gallery; Simply getting the list of files, and checking they exist in the file store, has a seemingly huge slowdown. This isn't limited to protected assets, but is worse for those files.

All of the items you listed. :)

sminnee commented 7 years ago

Solution for caching of stream content

Are you sure that this is necessary? As I understand the problem-space, if stream content has been sent to the webserver, we've done something wrong — all of the other S3 assets work (such as cloudfront-secured-URLs for protected assets) seems to be focused on avoiding having the assets get to the www servers in the first place.

The one place I can see it as being potentially necessary is for the resampling of assets, but there's probably a number of possible solutions to that.

One particular worry that I have is that the WWW server could very likely not have enough space on it to cache all the assets.

tractorcow commented 7 years ago

Major and important use case is resize-on-upload. For eventually-consistent assets, we can't access any newly added files, so the feature is broken. The only way to capture these assets is to cache anything written to s3 to a temp local file, so that the immediate request to resize thumbnails doesn't fail.

We COULD possibly use the actual upload tmp-file for that (accessed via $_FILES), but that may be problematic unless we persist it somewhere controlled by the cache.

I think this cache could be cleaned up at the end of request, i.e. any files created are then deleted in register_shutdown_function(). It may be acceptable to also only populate it on write.

tractorcow commented 7 years ago

I'll see if there is another way we could cache these manipulations. :)

tractorcow commented 7 years ago

Something like:

public function setFromLocalFile($path, $filename = null, $hash = null, $variant = null, $config = array())
{
    $result = $this->store()->setFromLocalFile($path, $filename, $hash, $variant, $config);
    // Cache path
    $key = $result['Hash'].'-'.$result['Variant'];
    $this->fileCache[$key] = $path;
    return $result;
}

public function getAsStream($filename, $hash, $variant = null)
{
    $key = $hash.'-'.$variant;
    if (isset($this->fileCache[$key]) && file_exists($this->fileCache[$hash])) {
        return fopen($this->fileCache[$key], 'rb');
    }
    return $this->store()->getAsStream($filename, $hash, $variant);
}

sminnee commented 7 years ago

I'm not so worried about whether or not it sits within the same caching subsystem, but I thought it would be useful to clarify exactly what our needs are, which seem to be:

In most execution caches (e.g. with a warm cache), the asset-admin UI needs to operate without webserver-mediated requests to the asset store. (But direct requests from the browser to the asset store, e.g. for getting the thumbnails themselves)
Image resampling on upload need to happen without an unnecessary fetch from the asset store.

Have I got that right?

tractorcow commented 7 years ago

Yes that's right. We need to eliminate the distance that heavy content (file streams) has to travel in order to be useful, since these operations can be expensive.

sminnee commented 7 years ago

Cool — I'm going to amend this card to list those the ACs and the existing content as the "proposed solution"

tractorcow commented 7 years ago

Thanks @sminnee I appreciate your feedback. :) Good job on the new A/Cs.

tractorcow commented 7 years ago

Updated with tests @sminnee

silverstripe / silverstripe-assets