Optimise Gallery Thumbnail transfer

chillu commented 7 years ago

Acceptance Criteria

When an image has a missing thumbnail, a fallback handler is available that can lazy-generate a thumbnail for an image based on thumbnail and return it.
When uploading a file, we do not re-request a server thumbnail if one has been generated locally in the browser that can be used.
Thumbnails for no more than one image are ever regenerated per request.
We never return thumbnail information from the server to the client that is already available on the client.

PRs

Notes

Upload requests are individual requests
Dropzone limits number of files in one upload operation
HTTP POST size and Dropzone limit size of an individual upload file

Background

WAS: Browsing unpublished assets can cause dozens of dynamic requests

In the new SS4 asset implementation, unpublished assets aren't directly accessible to the browser via URL. Instead, they're hidden away in an assets/.protected/ folder, and routed through a SilverStripe request to ProtectedFileController. This controller checks if the file access has been granted based on previous calls to FlysystemAssetStore->getAsURL(). These grants are stored in the PHP session. Files might come from the local filesystem, or through network access (e.g. from AWS S3).

So far, so good. But in reality, this means that anybody uploading a few dozen files will view a lot of protected files right after the upload. Our current pagination limit for admin/assets is 50, so that's at least 50 expensive dynamic requests in very quick succession, to show a single screen page of thumbnails. Browsers have up to 13 concurrent network connections per hostname. Not only is this significantly affecting the perceived loading performance of admin/assets, but it's also likely to cause rate limitations or server process crashes. For uncached network requests (e.g. to AWS S3), every request will also stall for 100s of milliseconds - which will further deplete the pool of web workers available for the website. So if left as-is, a single CMS author will regularly bring down websites just by opening admin/assets.

While we could lower the page size in admin/assets, this isn't addressing the root of the problem (20 expensive dynamic requests caused by a single user interaction are still too much).

Option 1: Publish files immediately after upload (and make "protected files" opt-in) Option 2: Create a more lightweight version of ProtectedFileController outside of the SilverStripe bootstrap (hard because it relies on YAML config and lots of other fun things in SilverStripe core) Option 3: Intercept image requests and batch them into a single call Option 4: Speed up the SilverStripe bootstrap #troll

I've purposely filed this ticket in framework as opposed to asset-admin, since asset-admin just surfaces one variation of the problem. Viewing an unpublished gallery page or a page listing of bunch of documents in custom code has the same effect.

Somewhat related: It'll make it lot more painful to run a SilverStripe site through https://github.com/silverstripe/silverstripe-serve, since php -S is single threaded, single connection - so you're working off those 50 image requests sequentially. With ~300ms response times, that's 15s of total load time, plus the initial CMS response times. And consequently, it'll make it veeeery slow to run Behat tests this way, assuming Selenium waits for window.onload rather than document.onload

There's a separate task for improving performance (memory usage) on streaming files: https://github.com/silverstripe/silverstripe-framework/issues/5462

Once this is solved, increase pagination size from 15 to 50 items in AssetAdmin (originally tracked under [https://github.com/silverstripe/silverstripe-asset-admin/issues/358]).

/cc @sminnee @tractorcow

tractorcow commented 7 years ago

Could we simply not re-generate the thumbnail on upload, and re-use the in-memory canvas in the interim? I realise this doesn't solve the issue completely, but it does ensure that you don't end up with a massive spike of dynamic requests immediately on upload (but rather delays it until the next time the image is requested from the server).

I would look at this plus one of the above suggestions as a complete solution.

chillu commented 7 years ago

I don't think only solving "load spike on upload" warrants the additional complexity - most of the user impact will be in browsing through assets, rather than uploading large amounts of them (less frequent user action). Thumbnail generation makes the problem worse (longer load times on the GraphQL response), but if we avoid it on upload, it'll hit the user and server the next time the folder is loaded - just as bad IMHO.

tractorcow commented 7 years ago

With my suggestion in place, thumbnail generation would still be performed on upload, but the react code would not re-request the url on successful upload. I.e. In the presence of both a local canvas and a remote url, it would err in favour of retaining the local canvas cache.

clarkepaul commented 7 years ago

Although I like opt 1, @chillu is this dangerous at all if the file contains sensitive information if it is left for say 10 sec published before becoming protected?

Side note: I have mentioned this before that for the files area it might be better to have the "save as draft" inside of the more options pop-up rather than the main action so that there is just the main action of publishing changes (this approach fits with auto save if we were to ever do it). Opt 1 would fit this approach to the UI nicely.

tractorcow commented 7 years ago

@chillu is this dangerous at all if the file contains sensitive information if it is left for say 10 sec published before becoming protected?

I would also have this concern; It would mean that there is no way to safely upload a sensitive file that is NEVER public.

I suggest still having the initial upload request generate the thumbnail, as this could be a memory intensive operation for certain files. Trying to resample multiple images in a single request would be the highest performance bottleneck.

Once all uploads are complete, you could assume that all thumbnails have been generated in the backend, and at that point you could potentially trigger thumbnails to be requested (option 3), or not until those assets are requested at a later time (my suggestion).

clarkepaul commented 7 years ago

The "save as draft" button within the assets area is actually pretty useless in most cases as you are not normally viewing the file in its live context as presented on page(s) directly. The "Save" button does give you the option to go and check those pages before publishing but I doubt people would bother.

sminnee commented 7 years ago

Option 1 defeats most of the value of asset staging/versioning. -100 points from me.

EDIT: However, it would be okay to add something to the upload UI that could let users optionally choose to "immediately publish these files so that they are publicly accessible"

chillu commented 7 years ago

We're conflating two different problems here:

Image thumbnail generation after upload causes long-running and memory-intensive GraphQL requests. If we defer this until later (not at upload), it's still the same issue.
Actual <img> HTTP requests overload the webserver with 50 somewhat concurrent dynamic requests. None of which will generate thumbnails, that's always done in GraphQL requests. So the 50 dynamic requests is the main issue I'm raising here.

The problem with allowing users to opt-in or opt-out from "publish immediately" is that a user action might still make the website unavailable. If a browser fires up to 13 concurrent dynamic requests to a webserver, actual website page views outside of the CMS will get queued until after they're completed. Best case: Only 13 concurrent requests returning in 150ms each = additional 0.6s load time ((50/4)*150). Worst case: 4 concurrent requests returning in 500ms each = additional 6.2s load time ((50/4)*500).

sminnee commented 7 years ago

The problem with allowing users to opt-in or opt-out from "publish immediately" is that a user action might still make the website unavailable. If a browser fires up to 13 concurrent dynamic requests to a webserver, actual website page views outside of the CMS will get queued until after they're completed. Best case: Only 13 concurrent requests returning in 150ms each = additional 0.6s load time ((50/4)150). Worst case: 4 concurrent requests returning in 500ms each = additional 6.2s load time ((50/4)500).

OK, best to ignore Option 1 and focus on Option 2-4 then.

sminnee commented 7 years ago

Option 2 is really a special case of Option 4 and is probably the best road to go down; Option 3 feels like a bit of a band aid that still leaves other uses of unpublished assets exposed.

dhensby commented 7 years ago

Option 4? I mean, it is needed one way or the other; was it not part of the 4.0 roadmap?

chillu commented 7 years ago

Option 4 (Speeding up bootstrap) will be a lot of work, with major areas like the Config API needing an overhaul, replacing extensions with traits, breaking a lot of APIs by more cleanly decoupling them (e.g. a standalone ORM). It's an ongoing effort, but from my understanding not a major focus of 4.x.

And frankly I wouldn't know how much we can chop out of the ProtectedFileController logic, it uses pretty much the whole gamut of core functionality (YAML config, routing, controller stack, ORM, flysystem lib). Best case scenario we get it down from 200ms to 100ms. That still means you tie up close to a dozen web workers for a few seconds, each with ~50MB of PHP memory minimum. Depending on max workers and memory available in your environment, this might be anywhere from 25% to 100% of your web server capacity. So this needs to be a LOT faster/lightweight to be useful.

sminnee commented 7 years ago

It probably needs to be decoupled from the ORM in some way. The system already sets marker values in the session to indicate which protected files the current session has access to; this should be done in a way that lets the fetching of the protected asset happen with a minimum of extraneous calls.

sminnee commented 7 years ago

@tractorcow do you have any recommendations for how we could approach this?

sminnee commented 7 years ago

A few thoughts:

We could limit SilverStripe-mediate access to assets to only those that are explicitly granted. Public assets have to bypass SilverStripe. This would simplify the number of checks available
/assets/* could be wired to route to asset-handler.php instead of main.php with a shorter execution flow.
For the sale of getting a PoC together hard-code the protected asset handler. With this approach, developers would need to replace asset-handler.php in its entirety to reconfigure assets, which isn't ideal. If this is demonstrated as being a big time-saver we might want to investigate some kind of generate-code-from-Injector approach that essentially compiles a given service definition into a few lines of PHP that doesn't need to have Injector or config loaded in order to access it.
We shouldn't need to call both has() and isGranted(). We should only perform an isGranted() check and leave it at that.
We should buffer and pass-through responses rather than using flysystem->read() (FlysystemAssetStore line 857)

I haven't done profiling of a protected asset request so that should be done first.

sminnee commented 7 years ago

FYI Ingo it doesn't look like ProtectedAssetController makes use of the database.

tractorcow commented 7 years ago

The only thing this controller requires access to is session and config. That does mean we can avoid making a DB connection. We could setup a separate main.php which evades db::connect() and route protected assets through this.

As per your last point @sminnee , yes I want to do buffered responses for these files. :) I started working on this a few months ago, but we dropped the priority a bit.

chillu commented 7 years ago

OK, had a good session with Sam for the last hour - we managed to get his script working with protected files: https://gist.github.com/chillu/e8811357a2cd290d88408de3e34b18fb. It replaces the Config layer with a simpler version that hardcodes the required config keys for this particular execution path. It’s a purpose-built script, so those savings won’t translate to other SilverStripe executions, but it’s encouraging!

Looking at XHProf requesting a protected file (on PHP 5.6): Standard routing: 110ms, 12MB RAM test-file.php: 25ms, 3MB RAM

Next steps:

Get url in the same way as main.php
Avoid autoloading File class (and all related gunk), by bypassing or replacing the File::join_paths() use in ProtectedFileController
Document how to replace a filesystem adapter and modify it's configuration (change config in two places)

Longer term steps (out of scope here):

Document how to generate files like these (identify which configuration keys are used)
Create a script to auto-generate these files (by running through one or more execution paths). Alternatively, cache individual config scopes so they can be pulled in on demand
Refactor bootstrap so these "alternative main.php" scripts become easier to maintain

sminnee commented 7 years ago

Ideally Michael Strong's new config implementation would provide some classes that would implement the stub implementation I did in that gist.

It may be worth pushing that forward for an alpha5 goal

tractorcow commented 7 years ago

I've left some ideas on reducing reliance on ProtectedFileController within asset-admin at https://github.com/silverstripe/silverstripe-asset-admin/issues/358#issuecomment-286293854.

chillu commented 7 years ago

Comment from Damian on https://github.com/silverstripe/silverstripe-asset-admin/issues/358

I was thinking about thumbnails / image optimisation

I think we need an API that allows us to get existing manipulations (e.g. padded, resized) which ONLY get existing generated images.

At the moment, for example, $Image.Pad(100, 100) gets a padded image, and if not regenerates it

We probably need $Image.ExistingOnly.Pad(100, 100) which only gets the padded version if found. This is important, because when we are retrieving a list of 30 images per request, we cannot risk loading 30 images into memory and lazy-generating thumbnails. Of those 30 images, a set number of those images may not have thumbnails, and we have some other process of re-generating those thumbnails (outside of the initial pagination query) e.g. just some other endpoint

Secondly, for the request to 50 images, we should (for thumbnails that exist) ensure that protected image thumbnails are sent via base64 encoded blobs in the parent request, to avoid unnecessary requests to the protectedfilecontroller.

so there are four requests that change:

initial upload request (triggers regeneration of the thumbnail). URL/blob isn't returned; Retain in-memory html5 generated thumbnail for these images (until refresh).
pagination request (graphql)
- protected images WITH thumbnails get blob returned
- public images WITH thumbnails get URL returned
- images without thumbnails get url to generate-url endpoint instead
request to protectedfilecontroller is no longer invoked by asset-admin (front-end only)
new endpoint (AssetAdmin) takes a file ID, regenerates the thumbnail, and returns either a url (public assets) or blob (protected assets). This endpoint acts only as a fallback, in case 1. fails to produce the thumbnail, or if it was deleted.

The goals of this are to:

Avoid hitting protectedfile controller many times
Avoid loading multiple images into memory in a single request
Make the UX slightly more responsive

Rough pseudocode for my ExistingOnly method.

public function ExistingOnly()
{
    $dbfile = DBFile::create();
    $dbfile->Filename = $this->Filename;
    $dbFile->Hash = $this->Hash;
    $dbFile->Variant = $this->Variant
    $dbFile->canGenerate = false;
    return $dbFile;
}

With a check in ImageManipulation::manipulate() on the canGenerate property.

chillu commented 7 years ago

Comment from Paul on https://github.com/silverstripe/silverstripe-asset-admin/issues/358

An observation more than an issue is that pagination only kicks in after the files area is reloaded. You can upload 1000 files and they will show all on one page, if you navigate away and back again they are limited on each page through pagination.

tractorcow commented 7 years ago

Copied and clarified from my linked comment at https://github.com/silverstripe/silverstripe-asset-admin/issues/358#issuecomment-286293854

We need an API that allows us to get existing manipulations (e.g. padded, resized) which ONLY get existing generated images. At the moment, for example, $Image.Pad(100, 100) gets a padded image, and if not regenerates it

We need $Image.ExistingOnly.Pad(100, 100) which only gets the padded version if found. This is important, because when we are retrieving a list of 30 images per request, we cannot risk loading 30 images into memory and lazy-generating thumbnails. Of those 30 images, a set number of those images may not have thumbnails, and we have some other process of re-generating those thumbnails (outside of the initial pagination query) e.g. just some other endpoint

Secondly, for the request to 50 images, we should (for thumbnails that exist) ensure that protected image thumbnails are sent via base64 encoded blobs in the parent request, to avoid unnecessary requests to the protectedfilecontroller.

so there are four requests that change:

initial upload request (triggers regeneration of the thumbnail). URL/blob isn't returned; Retain in-memory html5 generated thumbnail for these images (until refresh).
pagination request (graphql)
- protected images WITH thumbnails get blob returned via graphql. The blob size must be tightly controlled to limit graphql package response - Some NFR is required to be specified here. This could be base64 encoded, or otherwise compressed to limit size. This could be specified as a compatible data:image/jpeg;base64, XXXX string.
- public images WITH thumbnails get URL returned in the response instead.
- images without thumbnails get url to generate-url endpoint instead. See below on 4. for details.
request to protectedfilecontroller is no longer invoked by asset-admin (front-end only)
new endpoint (AssetAdmin) takes a file ID, regenerates the thumbnail, and returns either a url (public assets) or blob (protected assets). This endpoint acts only as a fallback, in case 1. fails to produce the thumbnail, or if it was deleted.

The goals of this are to:

Avoid hitting protectedfile controller many times
Avoid loading multiple images into memory in a single request
Make the UX slightly more responsive

Acceptance Criterea:
- When uploading a file, we do not re-request a server thumbnail if one has been generated locally that can be used.
- When an image is loaded into asset admin by graphql, thumbnail urls are only returned for public assets that have a thumbnail.
- When an image is loaded into asset admin by graphql, protected images that have thumbnails will have their content transmitted in the same request via a field.
- When an image has a missing thumbnail, a fallback handler is available that can lazy-generate a thumbnail for an image based on thumbnail and return it.
- Thumbnails for no more than one image are ever regenerated per request.
- We never return thumbnail information from the server to the client that is already available on the client.
- The client is never instructed to request thumbnail information from the server that is already available on the client.
- URLS to protected image handler are never utilised by asset admin.

chillu commented 7 years ago

Looking over the ACs in the description, and the additional/alternative ACs proposed by Damian in the comment above, I think we've covered this now.

We never return thumbnail information from the server to the client that is already available on the client.

Technically we're still returning thumbnail data from the server on reloads, it's just the first upload that keeps the data on the client (via the Upload JS lib). But I don't think that's a big deal, I don't want to get into custom cache management for data-urls and inline images (for protected assets)

Thumbnails for no more than one image are ever regenerated per request.

That's true in AssetAdmin now, but you can still overload a webserver by resizing a large amount of images in your own templates. That's no different from SilverStripe 3 through.

When an image has a missing thumbnail, a fallback handler is available that can lazy-generate a thumbnail for an image based on thumbnail and return it.

I've deleted thumbnails after they were auto-generated, and confirmed this is the case. It's not "lazy", but rather inlined in the GraphQL request. So that's an edge case which could still cause large PHP memory use and exec time within a GraphQL request. Asset folders upgraded from 3.x through MigrateFileTask won't have those thumbnails, but will auto-publish all files. So the GraphQL call will generate a max of ~20 image thumbnails by default (based on pagination). I've written this up as a smaller, focused task: https://github.com/silverstripe/silverstripe-asset-admin/issues/626

tractorcow commented 7 years ago

@chillu I've re-opened this because we found it necessary to fix thumbnail performance. @flamerohr and I have implemented this.

silverstripe / silverstripe-framework