Impending DST cutover caused jobs to run continuously

mackensen commented 6 years ago

Last evening, at exactly 9 PM EDT (2 AM GMT), hourly jobs started running continuously--instead of being scheduled for the next hour. This caused a major load event and we had to stop Cavalcade everywhere. Sample log showing the issue:

20413   wp_privacy_delete_old_export_files      2018-11-03 19:14:32     completed
20413   wp_privacy_delete_old_export_files      2018-11-03 20:14:32     completed
20413   wp_privacy_delete_old_export_files      2018-11-03 21:14:34     completed
20413   wp_privacy_delete_old_export_files      2018-11-03 21:14:36     completed
20413   wp_privacy_delete_old_export_files      2018-11-03 21:14:37     completed
20413   wp_privacy_delete_old_export_files      2018-11-03 21:14:39     completed
20413   wp_privacy_delete_old_export_files      2018-11-03 21:14:40     completed
20413   wp_privacy_delete_old_export_files      2018-11-03 21:14:41     completed
20413   wp_privacy_delete_old_export_files      2018-11-03 21:14:42     completed

This morning, it seems like we can get cavalcade happy by running wp site list --field=url | xargs -n 1 -I % wp cron event run --due-now --url=% to bring all jobs into the present. The WordPress site in question uses UTC internally; the host servers (both application and database) are on EDT/EST.

discoinfiltrator commented 6 years ago

@mackensen , similar thing happened here, took down our sites. Running the command you listed now - thanks for sharing that. Everything looking good for you since?

mackensen commented 6 years ago

@discoinfiltrator more or less (so far); noticed a related issue where jobs persisted for deleted sites and I'm clearing those out manually.

rmccue commented 6 years ago

Fascinating! Cavalcade should be using UTC timestamps throughout the entire system, as does WordPress, so unsure why this is occurring.

What is your timezone in PHP's settings (i.e. php.ini) set to?

mackensen commented 6 years ago

It's set to America/New_York in our environments.

discoinfiltrator commented 6 years ago

America/Toronto in my case. The solution of running:

wp site list --field=url | xargs -n 1 -I % wp cron event run --due-now --url=%

Didn't quite work out for me. There were too many duplicate jobs since the time switchover, and:

add_filter( 'pre_update_option_cron', __NAMESPACE__ . '\\update_cron_array', 10, 2 );
add_filter( 'pre_option_cron',        __NAMESPACE__ . '\\get_cron_array' );

from inc/connector/namespace.php was causing problems somehow. I need to investigate it further.

In the end, I backed up + truncated wp_cavalcade_logs and deleted every row from wp_cavalcade_jobs where interval wasn't null or status was completed or failed (e.g. I only kept the one-off events; the rest could reschedule themselves). The tables were getting pretty large, about 4.5 million rows in the logs table.

sirilyan commented 5 years ago

The problem is that Cavalcade-Runner sets a completed job's next run to nextrun + interval, even if that timestamp is now in the past. The job immediately gets picked up again and run. The DST changeover exposed this behaviour, but it'd also happen if the runner crashed and got restarted after a few hours. I'll submit a pull request on Cavalcade-Runner that seems to fix the problem.

rmccue commented 5 years ago

The problem is that Cavalcade-Runner sets a completed job's next run to nextrun + interval, even if that timestamp is now in the past.

This is intentional, so that if the Runner crashes (e.g.), the same number of jobs are run as it "catches up".

So long as everything is in UTC, there should be no issue with this, as the clock (generally) monotonically increases.

sirilyan commented 5 years ago

I've done a bit more investigation and it seems that under heavy load wp_next_scheduled() can return incorrect results, which makes wp_schedule_single_event() incorrectly queue multiple jobs.

I started up a new WP instance, created a simple hook I called "cavalcade_load" that just slept for a few seconds, and ran six copies of loadtest.sh in parallel:

#!/bin/bash
while true; do
    wp --path=/usr/share/wordpress eval "wp_schedule_single_event(time()+5, 'cavalcade_load');"
done

Stock WordPress refuses to populate the cron table with more than one job per second, no matter how many loadtest.sh I was running in parallel. This makes sense, because the epoch time is the array key.

Next step: install Cavalcade and Cavalcade-Runner, and spawn six copies of loadtest.sh. This produced

mysql> select id,start from wp_cavalcade_jobs where hook = 'cavalcade_load';
+------+---------------------+
| id   | start               |
+------+---------------------+
| 1797 | 2019-01-15 01:19:43 |
| 1798 | 2019-01-15 01:20:21 |
| 1799 | 2019-01-15 01:20:43 |
| 1800 | 2019-01-15 01:20:54 |
| 1801 | 2019-01-15 01:20:54 |
| 1802 | 2019-01-15 01:21:05 |
| 1803 | 2019-01-15 01:21:05 |
| 1804 | 2019-01-15 01:21:16 |
| 1805 | 2019-01-15 01:21:26 |
| 1806 | 2019-01-15 01:21:26 |
| 1807 | 2019-01-15 01:21:38 |
| 1808 | 2019-01-15 01:21:38 |
| 1809 | 2019-01-15 01:21:50 |
| 1810 | 2019-01-15 01:21:51 |
+------+---------------------+

Note there are eight rows where the start time is the same as the start time in another row. This should be impossible - even if we didn't have WP's implementation indirectly promising that only one hook per second can be created, the process that created id 1801 should still have seen the existing hook from id 1800. You can increase the number of collisions by running more simultaneous loadtest.sh - at around 20 or so I can get three at once.

Here's another interesting excerpt from the table, later on:

| 1862 | 2019-01-15 01:55:05 | completed |
| 1863 | 2019-01-15 01:55:05 | completed |
| 1864 | 2019-01-15 01:55:06 | completed |
| 1865 | 2019-01-15 01:55:05 | completed |
| 1866 | 2019-01-15 01:55:06 | completed |
| 1867 | 2019-01-15 01:55:06 | completed |
| 1868 | 2019-01-15 01:55:06 | completed |
| 1869 | 2019-01-15 01:56:05 | completed |

The start time should be increasing monotonically the same way that id is (all of these jobs are running on the same server, so they all share the same system clock), but we've got some that start at 6 seconds after in the middle of a set that start 5 seconds after.

This might be an issue with the cavalcade-jobs cache. I've got a local change where I query the database in is_created() for the job we're about to insert instead of assuming the cache is trustworthy, and it doesn't schedule duplicate jobs in the same second. I'll fork and create a branch if you're interested in seeing it.

discoinfiltrator commented 5 years ago

Yeah, this might not be directly responsible for the DST bug but it could be a factor. In wp-includes/cron.php, wp_schedule_single_event should prevent the same job from being scheduled if it's already scheduled within the next 10 minutes:

// Don't schedule a duplicate if there's already an identical event due within 10 minutes of it
$next = wp_next_scheduled($hook, $args);
if ( $next && abs( $next - $timestamp ) <= 10 * MINUTE_IN_SECONDS ) {
    return false;
}

I would expect that cavalcade be consistent with default WP behaviour.

peterwilsoncc commented 5 years ago

@discoinfiltrator There was a longstanding core bug in that code that will be fixed in WP 5.2, Cavalcade will inherit the bug fix.

discoinfiltrator commented 5 years ago

@rmccue , this happened again last night have I have good explanation for the cause and a fix.

In the Job class in Cavalcade-Runner, we have:

$this->nextrun = date( MYSQL_DATE_FORMAT, strtotime( $this->nextrun ) + $this->interval );

$this->nextrun is from the DB and is in UTC. However, PHP might be set with a different timezone, meaning the PHP date and strtotime functions could be using a timezone that has a daylight savings switchover. So we end up with:

php > echo date_default_timezone_get();
America/Toronto
php > echo date('Y-m-d H:i:s', strtotime( '2019-11-03 01:08:45' ) + 3600);
2019-11-03 01:08:45

Note that the resulting time string is the exact same as the input string. This causes the infinite loop where jobs scheduled between 1:00am and 2:00am with an interval of less than 3600s will be rescheduled for the exact time that they initially had.

We can fix this by making sure that this calculation is done with UTC in mind.

$date = new DateTime( '2019-11-03 01:08:45', new DateTimeZone( 'UTC' ) );
$date->add( new DateInterval( 'PT3600S' ) );
$nextrun = $date->format( 'Y-m-d H:i:s' );

This will correctly result in a nextrun string of 2019-11-03 02:08:45. I've created a PR to address this:

https://github.com/humanmade/Cavalcade-Runner/pull/64

archon810 commented 3 years ago

Aaand we just got bitten by this a few days ago. Found 16k jobs and counting, with the top dozen recurring jobs not updating the date.

maciejmackowiak commented 3 years ago

This is because of how php DateTime works and Cavalcade for converting the dates is using mysql2date which is using DateTime. To replicate it let say the site is in Los Angeles time zone and you have a task in DB with saved date as 2021-03-14 02:10:10.

When quering the job by Cavalcade it is converted by to_instance method with mysql2date to timestamp and in get_jobs_by_query back to date.

Try this in wp instance that is using Los Angeles time zone

echo '2021-03-14 02:10:10<br>';
echo mysql2date('G', '2021-03-14 02:10:10') .'<br>';
echo date( 'Y-m-d H:i:s', mysql2date('G', '2021-03-14 02:10:10') );

results in:

2021-03-14 02:10:10
1615691410
2021-03-14 03:10:10

So when it is queried back in pre_reschedule_event it cannot be found.

humanmade / Cavalcade

Impending DST cutover caused jobs to run continuously #74