update cache handling - Githubissues

coreykn commented 3 years ago

Update cache handling to get the cache file based on the request, mostly looking at the Cache Enabler settings and request headers. This updates the cache handling for both new and potentially cached pages by generating the required cache file through cache keys. This will be much better going forward as it simplifies handling current cache variants and adding additional cache variants in the future. The previous way of handling this would mean many methods and convoluted logic would have to be added on each additional cache variant. As each variant is added that would have only lead to more unnecessary complexity.

The cache keys are obtained in the following ways:

scheme: Scheme-based caching was introduced in PR #94 and released in version 1.4.0. This was then later updated in PR #98, PR #109, and PR #141. Change this to use what the WordPress is_ssl() function uses with the added ability to also check the X-Forwarded-Proto or X-Forwarded-Scheme request headers.
device: Device-based caching will be introduced in this PR and released in version 1.7.0. This will allow a separate mobile cache to be made based on the User-Agent request header. The values checked have been taken from the WordPress wp_is_mobile() function (#160).
webp: WebP-based caching was released in version 1.0.2. This still checks the Accept request header, but will now look for image/webp instead of just webp.
compression: Compression-based caching for Gzip has been supported since the initial release. This still checks the Accept-Encoding request header for gzip. This setup will make it easy for other compression algorithms to be introduced in the future.

This change means that only the requested cache file is created if it does not exist yet instead of all variants at once. This will reduce unnecessary memory usage, especially if many cache variants are used, and will reduce storing unnecessary cache variants that may never even be requested. (This will impact anyone using cache warming techniques because a request for each variant to cache will now have to be made.) This also ensures the client gets what it has requested instead of falling back to the default HTML file if the requested variant does not exist yet but the default HTML file did. (That was pretty rare edge case as it would only previously occur if a cache variant setting was enabled after cached pages were already generated and the cache was not cleared.)

Getting the request headers has been updated as in some cases apache_request_headers() was a function that existed but returned an empty array or false. This lead to some cache variants not being delivered even when accepted by the client. This new way will try to get the request header from apache_request_headers() and if not set then fallback to the $_SERVER variable. If neither are set then an empty string to prevent having to always check if it is set. Only checking $_SERVER variables was not used as some environments do not provide all of these. This way should get the specific request header values we need across many different configurations if it is available.

The cache signature will now include the file name, such as https-index-webp.html.gz, instead of https webp gzip because of this new handling. We can filter this and update it to the old format if preferred. (I do not have a preference, this way just reduces the overhead.) The cache signature will now also include the generated date in HTTP-date format (e.g. D, d M Y H:i:s GMT) instead of d.m.Y H:i:s.

Update Cache_Enabler_Disk::get_settings_file_name() to parse the $_SERVER['REQUEST_URI'] value and get the PHP_URL_PATH when in the subdirectory network fallback to prevent a warning from occurring in the following preg_grep() function when the path contained a query string. The ? from the query string would cause a regular expression error if the path had a trailing slash (preceding token is not quantifiable, like in \.(path|to|page|?param=value)).

For a future note, error handling when creating directories and compressing page contents needs to be improved. wp_die() for failed cache file directory creation has been removed as it did not work and would have caused an output buffer error instead. As far as I can tell this would have always occurred in all past versions. This would not occur when trying to create the settings file, but it has been removed for this as well as it is not a graceful way to handle this type of error. If Gzip compression failed the page will not be created. These all rarely occur, if they ever do as no issues have been opened around these scenarios, but in the event they do the user should receive additional insight to what is occurring so it can be resolved.

Closes #160

nlemoine commented 3 years ago

Hi @coreykn,

Just tested this last PR. The new approach on generating cache files is really brilliant and more concise.

However, it has a major downside when using it with advanced rewrite rules (#176). Take the following scenario:

A user visit with browser not supporting Gzip visit /page/
The /page/https-index.html file is created in the cache directory
Apache or Nginx will serve this file to the next visitors, whether they support Gzip or not and gzip will never be created

coreykn commented 3 years ago

Thank you! That's great you've been able to test this PR, @nlemoine. That is always helpful and appreciated as it allows further testing and feedback to be received before a new version is released.

I agree that is a current downside, however, the advanced configuration that allows the server to deliver the cached files, bypassing PHP entirely, can just be updated to account for this change. I plan on doing this in the docs with eventually automatically generating the snippet based on the settings, like requested in #176. The idea would be the new server configuration is actually quite similar to how the plugin itself generates the cache file path, where certain environment variables are set as cache keys depending on what is accepted. If the server doesn't find the file it would continue to PHP allowing WordPress to start, which would then allow Cache Enabler to generate the requested page.

nlemoine commented 3 years ago

Thanks for your feedback @coreykn

I'm also using a cache preloader that crawls URLs to generate all versions cache files (html, gz). This is now impossible in a single run. I have to run it 3 times with different accept-encoding headers.

Would you be open to setting up some filters/actions to provide the option for developers to generate all files if desired? I think that's still something that might needed.

coreykn commented 3 years ago

You're welcome! That was in consideration when making this change:

This will impact anyone using cache warming techniques because a request for each variant to cache will now have to be made.

Always happy to consider any request. Did you have something specific in mind for that type of support? Off the top of my head I'm unsure how it'd be implemented by just adding one or more hooks. My preference is page caching based on the request as the traffic to a website handles this process quite efficiently. If it is desired to simulate that traffic to have pages cached then the type of traffic simulated would have to be adjusted to account for whatever variants want to be cached.

nlemoine commented 3 years ago

My preference is page caching based on the request as the traffic to a website handles this process quite efficiently.

I also think it's a smart approach but it does not fit all use cases and I believe it would be great to leave developers the freedom to handle cache file generation another way.

On site with a large/huge number of pages, some pages might not be hit often enough to warm up cache. As you probably know, performance is more and more important to SEO. Having an always warmed up cache is sometimes needed.

I'm also generating a Brotli version which leads to 3 potential variants. I wonder if this approach stills brings benefit. Take the following comparison.

On demand scenario:

User accept-HTML-only hits for the first time /page/ -> /page/https-index.html (cache miss, "slow" because served from PHP/MySQL)
User accept-gzip hits for the first time /page/ -> /page/https-index.html.gz (cache miss, "slow" because served from PHP/MySQL)
User accept-brotli hits for the first time /page/ -> /page/https-index.html.br (cache miss, "slow" because served from PHP/MySQL)

For each variant, we hit the server 3 times and 3 different visitors had a "slow" experience (it's of course a relative "slow", I mean not as fast a cache hit, especially in the WordPress world).

All variants scenario:

User accept-whatever-encoding hits for the first time /page/ -> /page/https-index.html{.gz,.br} (cache miss, "slow" because served from WordPress, maybe a bit slower than in the "on-demand" scenario but probably not much slower)

In this case, only one visitor hit the server and got a "slow" experience. Subsequent visitors, whatever variant they support will be served a cached page.

What is the more performant way? Hitting 2/3 times the server to generate one cache file or Hitting one time the server to generate 2/3 cache files. That's something that would be interesting to benchmark.

Did you have something specific in mind for that type of support? Off the top of my head I'm unsure how it'd be implemented by just adding one or more hooks

It's not that easy indeed. I tried to tackle it with only one hook and to not change too many lines of code. Here's my first shot on this: https://github.com/keycdn/cache-enabler/compare/master...nlemoine:create-cache-pages

I had to refactor a bit. I split the create_cache_file method to extract a "pure" file creation method.

A filter example:

add_filter('cache_enabler_cache_files', function($files, $page_contents_raw, $cache_path) {
    $files[$cache_path . '/https-index.html'] = $page_contents_raw;
    return $files;
}, 10, 3);

Let me know what you think.

coreykn commented 3 years ago

Thank you for your detailed feedback, @nlemoine. That is very helpful. You're definitely right, the default handling will not fit all use cases. Let me review your approach in more detail and consider how this may be able to be introduced. If we don't allow all variants to be created at once I believe we'll likely at least add a way for both the compressed and uncompressed versions of a variant to be created as some server configurations expect this. Updates to follow.

nlemoine commented 3 years ago

Just realized while setting some options in the admin that what you call "variant" is either: webp/gzip/mobile.

In my mind, a "variant" is only a different file encoding: uncompressed, gzip, brotli.

keycdn / cache-enabler

update cache handling #211