humanmade / Cavalcade

A better wp-cron. Horizontally scalable, works perfectly with multisite.
https://engineering.hmn.md/projects/cavalcade/
Other
529 stars 46 forks source link

Use persistent cache. #93

Open peterwilsoncc opened 4 years ago

peterwilsoncc commented 4 years ago

Follow up to discussion in PR #91

Cavalcade uses a non-persistent cache for the cavalcade-jobs group. As the runner triggers WP CLI tasks within the plugin, invalidating on UPDATE, INSERT and DELETE should be a relatively simple task.

Proposal:

roborourke commented 4 years ago

Re. making it a global group I think given that the site ID is part of the SQL query it makes sense. In the rare event anyone would need to query crons for a different site it won't have any issues then.

peterwilsoncc commented 4 years ago

The reason I am in two minds about global or single site groups has to do with group invalidation.

If the group is global, any time a site on multisite updates a cron job, the queries and jobs for the other sites would be cleared too.

If the multisite runs a lot of sites, for example a large company with a presence in multiple countries, then the cache could be lost frequently. For the edge case of cross-site searching, I think it's potentially better as a single site cache.

roborourke commented 4 years ago

Makes sense to me, thanks Peter 👍

roborourke commented 4 years ago

Per @rmccue's comment on #91

When it's used and updated in Cavalcade Runner, it doesn't have proper access to the object cache; in particular, we don't know (IIRC) the cache key generation algorithm. Because of this, the cache can't be authoritative.

So, we should keep the cache as non-persistent however we could add a constant that if defined does add it as a persistent global group. This will benefit users not using the runner but just improving cron storage.

The database updates in #95 should give a good enough performance boost for those using the runner.

peterwilsoncc commented 4 years ago

In order to use the cache in the runner, is it possible to do any of these:

And do this in a backward compatible way to avoid ruining everything if the plugin is updated without the runner or vice versa?

rmccue commented 4 years ago

We'd essentially have to load all of WordPress in to be able to use the object cache, which is highly likely to cause severe memory leaks and undefined behaviour, since WordPress isn't designed for long-running processes.

What we could potentially do is move these operations from the runner into the run wp-cli command, but we'll likely run into problems with detecting fatal errors etc with that, as the job is now monitoring itself rather than being overwatched.

peterwilsoncc commented 4 years ago

Could Cavalcade the plugin use the runner's Runner.check_workers.job_completed hook to clear a persistent cache, similar to the method Altis Cloud uses for logging jobs?

https://github.com/humanmade/altis-cloud/blob/baf9587948b77040af58ff6d823fcc932150bd32/inc/cavalcade_runner_to_cloudwatch/namespace.php#L55-L69

rmccue commented 4 years ago

Altis Cloud is loaded differently; it's loaded as part of Altis during the very early stages of the bootstrap setup, and crucially, as part of the wp-config loading. Regular plugins (like Cavalcade) are loaded by WordPress, so aren't loaded into the Runner at all.

We could provide a special mode for Cavalcade that says "yes, it's loaded in early and the cache should be enabled", but that seems like a very messy solution that doesn't gain us much.

Curious to know if you're seeing actual perf problems from the lack of cache. I'm yet to see it take any significant amount of time, even on sites with a lot of jobs.

peterwilsoncc commented 4 years ago

Curious to know if you're seeing actual perf problems from the lack of cache. I'm yet to see it take any significant amount of time, even on sites with a lot of jobs.

We're not seeing any at the moment, but I'm concerned about the affect of the increased number of DB calls by using the new filters. Work has a few hundred journalists in the admin all day so we don't get the benefit of full page caching.

How about introducing a filter in register_cache_groups() to allow for setups similar to Altis's?

wp-config.php:

use HM\Cavalcade\Runner\Runner;

require_once __DIR__ . '/wordpress/wp-includes/plugin.php';

Runner::instance()->hooks->register( 'Runner.check_workers.job_completed', function() {
    wp_cache_set( 'last_changed', microtime(), 'cavalcade-jobs' );
} );

add_filter( 'hm.cavalcade.use-persistent-cache', function() {
    return false;
} );

cavalcade/inc/namespace.php:

/**
 * Register the cache groups
 */
function register_cache_groups() {
    wp_cache_add_global_groups( [ 'cavalcade' ] );

    /**
     * Docblock with clear warning about how to use this.
     */
    $persistent_cache = apply_filters( 'hm.cavalcade.use-persistent-cache', true );
    if ( ! $persistent_cache ) {
        return;
    }
    wp_cache_add_non_persistent_groups( [ 'cavalcade-jobs' ] );
}
peterwilsoncc commented 3 years ago

WordPress.org was having a particularly high number of database hits recently.

While rubber ducking a fix/work around with Dion I was wondering if a persistent cache could be cleared on the shutdown hook if DOING_CRON === true? That would allow the runner to continue with rescheduling, fatal monitoring, etc.

At this stage, this is a pretty random brain expulsion: do y'all reckon it's worth exploring?

dd32 commented 3 years ago

In the case of WordPress.org, it was a surge of uncached hits that caused issues which resulted in loading of WordPress and causing many cronjobs to be checked for existence on each request.

I've recently added some super simple caching to get_scheduled_event() so that wp_next_scheduled() wouldn't require a DB hit for every call. It's not super significant enough by itself, but it certainly adds to issues when things snowball.

In the case of WordPress.org, caching the wp_next_scheduled() calls between pageloads can be the difference between 10 DB queries or zero (and as a result, no DB connection required) as the majority of page data is available entirely cached via the object cache (For the posts query, we use https://github.com/Automattic/advanced-post-cache )

For reference, here's a bit of code I've deployed on WordPress.org with a super short 10s cache, it's not designed to be super efficient or cover all bases, just to limit the number of repetitive queries for jobs that have been running fine for in some cases literally years.

// Cache wp_next_scheduled() for a short time
add_filter( 'pre_get_scheduled_event', function( $filter, $hook, $args, $timestamp ) {
    static $stack = [];

    // Abort if another plugin has already done something to this filter
    if ( null !== $filter ) {
        return $filter;
    }

    $key = 'job:' . sha1( $hook . ':' . serialize( $args ) . ':' . $timestamp );
    $next = wp_cache_get( $key, 'cron-jobs-cache' );
    if ( $next && $next->timestamp > time() ) {
        return $next;
    }

    // Don't loop.
    if ( isset( $stack[ $key ] ) ) {
        return $filter;
    }

    $stack[ $key ] = true;

    $next = \wp_get_scheduled_event( $hook, $args, $timestamp );

    $cache_time = 10;
    // Don't cache if the event is in the next few seconds.
    if ( $next && $next->timestamp > ( time() + $cache_time ) ) {
        wp_cache_set( $key, $next, 'cron-jobs-cache', $cache_time );
    }

    unset( $stack[ $key ] );

    return $next;
}, 9, 4 );

// Clear the cache before it's unscheduled (TODO: This should potentially be after clearing, on priority 11)
add_filter( 'pre_unschedule_event', function( $filter, $timestamp, $hook, $args ) {
    $key = 'job:' . sha1( $hook . ':' . serialize( $args ) . ':' . $timestamp );
    wp_cache_delete( $key, 'cron-jobs-cache' );
    return $filter;
}, 9, 4 );

// Clear the cache after an event is scheduled
add_filter( 'pre_schedule_event', function( $filter, $event ) {
    $key = 'job:' . sha1( $event->hook . ':' . serialize( $event->args ) . ':' . $event->timestamp );
    wp_cache_delete( $key, 'cron-jobs-cache' );
    return $filter;
}, 11, 2 );

It seemed to do the trick and hasn't caused any visible effects yet (such as duplicate scheduled tasks etc)