add cache iterator - Githubissues

coreykn commented 3 years ago

With the cache file creation changes brought in version 1.7.0 the same type of individual file control was needed for existing cache files. This PR introduces a new way of handling the cache by adding the Cache_Enabler_Disk::cache_iterator() method.

Why `Cache_Enabler_Disk::cache_iterator()` is being introduced

The idea behind this change is to bring complete control when iterating over the cache to perform tasks and gather data, such as clearing the cache or obtaining the current cache size. This will prevent having to create a new method that pulls cache objects and loops through them each time the cache needs to be handled in a specific way. For example, clearing any expired cached pages, which was a fantastic idea suggested in PR #223 by @stevegrunwell. Instead, all that is required is a URL to a potentially cached page and arguments. The iterator will then pull the specified cache, perform actions, and return the iterated data. A custom solution was made over an iterator class in PHP due to the custom requirements. In addition, the custom solution is faster (which would only really make a difference with really large cache directories).

One of the most important benefits of this approach in my opinion is consistency. Having one primary way to iterate over the cache will ensure the handling is consistent across all present and future methods that use this method. It prevents having to consider all of the cases that need to be considered each time the cache needs to be handled.

How `Cache_Enabler_Disk::cache_iterator()` works

Cache_Enabler_Disk::cache_iterator() currently has two parameters, $url and $args. The $url parameter is a string containing a URL (now with or without a scheme) to a potentially cached page. The $args parameter is an array of the arguments to tell the iterator what to do. The $url parameter can contain a query string with the arguments set. This is converted to an array with wp_parse_str() and will overwrite any parameters set in $args. The arguments currently available to use are the following:

clear: Tells the iterator whether or not to clear the cache. This is an integer and accepts 0 or 1. Setting this to 1 will clear the cache being iterated over. Default is 0.
subpages: Tells the iterator what subpages to include and/or exclude. This is either an integer or array. If an integer it accepts 0 or 1. Setting this to 0 will not include any subpages. Setting this to 1 will include all subpages. If an array it accepts specific subpages that can be either included and/or excluded:
```
'subpages' => array(
    'include' => 'path/to/page,path/to/another/page',
    'exclude' => 'path/to/page/but/not/this/page',
)
```
The example above says to include both path/to/page and path/to/another/page but exclude path/to/page/but/not/this/page. This is very important when it comes to getting the cache in a specific way, such as the site cache for the root blog in a subdirectory network where all subsites' blog paths need to be excluded. Or, if an archives page is cleared and all the pagination pages need to be included. If a page is included that means ignore all subpages but what is included. If a page is excluded it means include all subpages but what is excluded. If both are provided then the inclusion behavior will take place while allowing what would otherwise be included to be excluded (Cache Enabler itself doesn’t use both inclusions and exclusions at this time). Default is 0.
keys: Tells the iterator what cache versions to include and/or exclude. This is either an integer or array. If any integer is set then all cache versions will be included, meaning cache keys aren’t considered. If an array it accepts specific cache keys to either include and/or exclude:
```
'keys' => array(
    'include' => 'https',
    'exclude' => 'mobile',
)
```
The example above says to include cache versions that are https but not mobile. This brings control to the actual cache files that are iterated over. The current cache keys are https, http, mobile, webp, and gz. Default is 0.
expired: Tells the iterator whether or not to only include expired cache files. This is an integer and accepts 0 or 1. Setting this to 1 will only include expired cache files. Default is 0.
hooks: Tells the iterator what cache hooks to include and/or exclude. This is either an integer or array. If an integer it accepts 0 or 1. Setting this to 0 will not include any hooks. Setting this to 1 will include all hooks. If an array it accepts specific hooks that can be either included and/or excluded:
```
'hooks' => array(
    'include' => 'cache_enabler_page_cache_cleared',
)
```

All inclusions and exclusions can either be a comma (,) or pipe (|) separated list. Alternatively, instead of a string it can be an array of inclusions/exclusions. The associative array key names can be abbreviated with i for include and e for exclude. Or, 0 for include and 1 for exclude if an associative array is not used. This is helpful when passing the parameters in a URL, like from the command line with WP-CLI, for example:

# clear www.example.com, www.example.com/path/to/page, and www.example.com/path/to/another/page but not www.example.com/path/to/page/but/not/this/page

$ wp cache-enabler clear --urls=www.example.com?subpages[include]=path/to/page\|path/to/another/page&subpages[exclude]=path/to/page/but/not/this/page
$ wp cache-enabler clear --urls=www.example.com?subpages[i]=path/to/page\|path/to/another/page&subpages[e]=path/to/page/but/not/this/page
$ wp cache-enabler clear --urls=www.example.com?subpages[]=path/to/page\|path/to/another/page&subpages[]=path/to/page/but/not/this/page

# clear all `https` cached versions that are not `mobile` for www.example.com/page

$ wp cache-enabler clear --urls=www.example.com/page?keys[i]=https&keys[e]=mobile

A delimiter separated string wasn’t used as we wouldn’t know if the page path is 1 or to clear all subpages when this value is passed through a query string.

After the cache has been iterated over it will return an associative array with the related cache data:

array(
    'index' => array(
        '/full/path/to/cached/page' => array(
            'url'      => 'https://www.example.com/path/to/cached/page',
            'id'       => (int) $post_id,
            'versions' => array( 'https-index.html' => (int) $file_size ),
    ),
    'size' => (int) $index_size,
);

If the cache was cleared the cache size integers will be negative.

Other notable changes

While the primary change was introducing Cache_Enabler_Disk::cache_iterator(), this had a cascading effect across the code base. It allowed for new features to be added and current methods to be simplified. Furthermore, a cache cleared hook bug was found and fixed.

No longer need system class properties for cache clearing. This means Cache_Enabler::$fire_page_cache_cleared_hook and Cache_Enabler_Disk::$dir_cleared have been removed.
Add the Cache_Enabler::schedule_events() method to manage scheduling cron events. The only event currently being scheduled is clearing the expired cache every hour if it’s set to expire. Credit for this behavior goes to @stevegrunwell.
Add the Cache_Enabler::on_cache_created_cleared() method to handle actions after the cache has been either created or cleared. Currently this handles keeping the cache size up to date compared to the old way. That means the cache size will now be dynamically updated as a page is created or cleared allowing a “live” cache size preview. (I wasn’t able to find any negative performance effects from this change as that database is already queried when a page is generated.) This should reduce how often Cache Enabler scans the cache directory, which should help a lot for large cache directories on a server with low resources. It also paves the way to allow the cache to either be fully or partially cleared if the size reaches a certain level.
Remove the Cache_Enabler::get_cache_size_transient_name() method. This was originally used because it had a dynamic value from the current blog ID, but this was unnecessary and removed in version 1.6.0. There isn’t really a need for this anymore.
Add the cache_enabler_page_cache_created action hook after a cache file has been created.
Add the $cache_cleared_index parameter to both the cache_enabler_page_cache_cleared and cache_enabler_site_cache_cleared hooks. This is the index array returned from Cache_Enabler_Disk::cache_iterator().
Fix the cache_enabler_site_cache_cleared action hook from not firing if the root blog was cleared but a subsite’s cache still existed. To avoid this, we will check the cache index for the current site to ensure that it’s empty instead of checking if the site cache directory exists.

Feedback

I understand this is a big change, which is something I try to avoid but at times it's needed. I'm very excited about this addition as I believe it makes handling the cache a breeze in comparison to the old way. If anyone has any feedback, or has the time to test these changes, that is always appreciated.

erikdemarco commented 3 years ago

@coreykn This is very very nice idea using cache iterator to clean expired cache via cron. But does it scale?

for example a site having hundred thousand of post cached. And the expiry_time set to 1 hour. and php time_limit is set to 30 seconds (php default settings). Will it finish delete all those in those timeframe?

Deleting all those pages at the same time doesn't feel like very friendly solution to server resources. Does this PR consider this scale? Does it have batch processing capabilites? Is there any check if the cleaning process successfull?

The old way clearing cache (check and clear only when the url requested) is much more friendly to server resources and much more stable. This is also how nginx delete expired content. So this way of clearing the cache files dont need to be doubt anymore. It has been battle tested.

coreykn commented 3 years ago

@erikdemarco, in terms of scale, there are eventual bottlenecks in the cache iterator where issues could arise. From what I found that primarily comes down to the limitations of the server Cache Enabler is installed on, like for the time limit you've mentioned, and/or PHP itself for really, really deeply nested pages. The cache iterator still uses scandir(), same as in version 1.7.2.

The expiry time is per file as it checks the time that the individual cache file was last modified (each cache version is a different file). That means for the hypothetical scenario that you're describing it would require that all 100,000 pages be cached at nearly the same time. While I performed many tests while creating this, I have not tested the cache iterator in the way that you've described. We do however welcome and encourage any tests of your own that you'd like to perform. The more ways that we can get it tested the better. 🙂

The "old" way of overwriting an expired cache file still exists. If when a page is requested and it is expired, Cache Enabler will not deliver the cached file. Instead, Cache Enabler will generate a new cache file and overwrite the existing file. This PR just added a scheduled WordPress cron that will remove expired cached files on an hourly basis. In terms of server resources, this behavior is no different than what already exists in version 1.7.2 as the cache is regularly scanned for the cache size. Controlling this cron can be done with a third party plugin. Adding a Cache Enabler filter hook to disable this type of behavior is an option as well.

If you have any feedback or suggestions to improve this new cache iterator please let us know.

coreykn commented 3 years ago

It also paves the way to allow the cache to either be fully or partially cleared if the size reaches a certain level.

For example:

add_action( 'set_transient_cache_enabler_cache_size', 'on_set_transient_cache_enabler_cache_size' );

function on_set_transient_cache_enabler_cache_size( $cache_size ) {

    $max_cache_size = 500000000; // 500 MB

    if ( $cache_size > $max_cache_size ) {
        // do something magical
    }
}

keycdn / cache-enabler

add cache iterator #237

Why `Cache_Enabler_Disk::cache_iterator()` is being introduced

How `Cache_Enabler_Disk::cache_iterator()` works

Other notable changes

Feedback

keycdn / cache-enabler

add cache iterator #237

Why Cache_Enabler_Disk::cache_iterator() is being introduced

How Cache_Enabler_Disk::cache_iterator() works

Other notable changes

Feedback

Why `Cache_Enabler_Disk::cache_iterator()` is being introduced

How `Cache_Enabler_Disk::cache_iterator()` works