keycdn / cache-enabler

A lightweight caching plugin for WordPress that makes your website faster by generating static HTML files.
https://wordpress.org/plugins/cache-enabler/
123 stars 46 forks source link

add cache iterator #237

Closed coreykn closed 3 years ago

coreykn commented 3 years ago

With the cache file creation changes brought in version 1.7.0 the same type of individual file control was needed for existing cache files. This PR introduces a new way of handling the cache by adding the Cache_Enabler_Disk::cache_iterator() method.

Why Cache_Enabler_Disk::cache_iterator() is being introduced

The idea behind this change is to bring complete control when iterating over the cache to perform tasks and gather data, such as clearing the cache or obtaining the current cache size. This will prevent having to create a new method that pulls cache objects and loops through them each time the cache needs to be handled in a specific way. For example, clearing any expired cached pages, which was a fantastic idea suggested in PR #223 by @stevegrunwell. Instead, all that is required is a URL to a potentially cached page and arguments. The iterator will then pull the specified cache, perform actions, and return the iterated data. A custom solution was made over an iterator class in PHP due to the custom requirements. In addition, the custom solution is faster (which would only really make a difference with really large cache directories).

One of the most important benefits of this approach in my opinion is consistency. Having one primary way to iterate over the cache will ensure the handling is consistent across all present and future methods that use this method. It prevents having to consider all of the cases that need to be considered each time the cache needs to be handled.

How Cache_Enabler_Disk::cache_iterator() works

Cache_Enabler_Disk::cache_iterator() currently has two parameters, $url and $args. The $url parameter is a string containing a URL (now with or without a scheme) to a potentially cached page. The $args parameter is an array of the arguments to tell the iterator what to do. The $url parameter can contain a query string with the arguments set. This is converted to an array with wp_parse_str() and will overwrite any parameters set in $args. The arguments currently available to use are the following:

All inclusions and exclusions can either be a comma (,) or pipe (|) separated list. Alternatively, instead of a string it can be an array of inclusions/exclusions. The associative array key names can be abbreviated with i for include and e for exclude. Or, 0 for include and 1 for exclude if an associative array is not used. This is helpful when passing the parameters in a URL, like from the command line with WP-CLI, for example:

# clear www.example.com, www.example.com/path/to/page, and www.example.com/path/to/another/page but not www.example.com/path/to/page/but/not/this/page

$ wp cache-enabler clear --urls=www.example.com?subpages[include]=path/to/page\|path/to/another/page&subpages[exclude]=path/to/page/but/not/this/page
$ wp cache-enabler clear --urls=www.example.com?subpages[i]=path/to/page\|path/to/another/page&subpages[e]=path/to/page/but/not/this/page
$ wp cache-enabler clear --urls=www.example.com?subpages[]=path/to/page\|path/to/another/page&subpages[]=path/to/page/but/not/this/page

# clear all `https` cached versions that are not `mobile` for www.example.com/page

$ wp cache-enabler clear --urls=www.example.com/page?keys[i]=https&keys[e]=mobile

A delimiter separated string wasn’t used as we wouldn’t know if the page path is 1 or to clear all subpages when this value is passed through a query string.

After the cache has been iterated over it will return an associative array with the related cache data:

array(
    'index' => array(
        '/full/path/to/cached/page' => array(
            'url'      => 'https://www.example.com/path/to/cached/page',
            'id'       => (int) $post_id,
            'versions' => array( 'https-index.html' => (int) $file_size ),
    ),
    'size' => (int) $index_size,
);

If the cache was cleared the cache size integers will be negative.

Other notable changes

While the primary change was introducing Cache_Enabler_Disk::cache_iterator(), this had a cascading effect across the code base. It allowed for new features to be added and current methods to be simplified. Furthermore, a cache cleared hook bug was found and fixed.

Feedback

I understand this is a big change, which is something I try to avoid but at times it's needed. I'm very excited about this addition as I believe it makes handling the cache a breeze in comparison to the old way. If anyone has any feedback, or has the time to test these changes, that is always appreciated.

erikdemarco commented 3 years ago

@coreykn This is very very nice idea using cache iterator to clean expired cache via cron. But does it scale?

for example a site having hundred thousand of post cached. And the expiry_time set to 1 hour. and php time_limit is set to 30 seconds (php default settings). Will it finish delete all those in those timeframe?

Deleting all those pages at the same time doesn't feel like very friendly solution to server resources. Does this PR consider this scale? Does it have batch processing capabilites? Is there any check if the cleaning process successfull?

The old way clearing cache (check and clear only when the url requested) is much more friendly to server resources and much more stable. This is also how nginx delete expired content. So this way of clearing the cache files dont need to be doubt anymore. It has been battle tested.

coreykn commented 3 years ago

@erikdemarco, in terms of scale, there are eventual bottlenecks in the cache iterator where issues could arise. From what I found that primarily comes down to the limitations of the server Cache Enabler is installed on, like for the time limit you've mentioned, and/or PHP itself for really, really deeply nested pages. The cache iterator still uses scandir(), same as in version 1.7.2.

The expiry time is per file as it checks the time that the individual cache file was last modified (each cache version is a different file). That means for the hypothetical scenario that you're describing it would require that all 100,000 pages be cached at nearly the same time. While I performed many tests while creating this, I have not tested the cache iterator in the way that you've described. We do however welcome and encourage any tests of your own that you'd like to perform. The more ways that we can get it tested the better. 🙂

The "old" way of overwriting an expired cache file still exists. If when a page is requested and it is expired, Cache Enabler will not deliver the cached file. Instead, Cache Enabler will generate a new cache file and overwrite the existing file. This PR just added a scheduled WordPress cron that will remove expired cached files on an hourly basis. In terms of server resources, this behavior is no different than what already exists in version 1.7.2 as the cache is regularly scanned for the cache size. Controlling this cron can be done with a third party plugin. Adding a Cache Enabler filter hook to disable this type of behavior is an option as well.

If you have any feedback or suggestions to improve this new cache iterator please let us know.

coreykn commented 3 years ago

It also paves the way to allow the cache to either be fully or partially cleared if the size reaches a certain level.

For example:

add_action( 'set_transient_cache_enabler_cache_size', 'on_set_transient_cache_enabler_cache_size' );

function on_set_transient_cache_enabler_cache_size( $cache_size ) {

    $max_cache_size = 500000000; // 500 MB

    if ( $cache_size > $max_cache_size ) {
        // do something magical
    }
}