humanmade / Cavalcade

A better wp-cron. Horizontally scalable, works perfectly with multisite.
https://engineering.hmn.md/projects/cavalcade/
Other
528 stars 46 forks source link

Issues with the index on `nextrun` #106

Closed tfrommen closed 3 years ago

tfrommen commented 3 years ago

In #95 (👋 @peterwilsoncc 🙂), several new indexes on the jobs table were introduced, one of them being nextrun, on the column with the same name.

I would like to question the benefit of having said index, and, in fact, illustrate why I think it actually is a performance burden than anything else.

Context

Firstly, the nextrun column is of type datetime.

Given the nature of the data, that is, information about cron tasks, it is highly unlikely to have several rows with the exact same timestamp. Sure, for high-traffic sites with lots of scheduled tasks, and/or bugs (or just logical errors) in the code scheduling said tasks, you might end up with a few tasks being scheduled for the exact same time. In reality, however, this is not often the case. And even if it was, I assume we would be talking about less than 10 and definitely less than 100 tasks. Nothing that an index would make a meaningful difference in.

Since no other index covers the nextrun column, MySQL (or whatever your DB is) will try and make good use of the dedicated nextrun index, whenever the nextrun column is used for querying and/or sorting/grouping etc.

Problem

In theory, the nextrun index is a one-to-many mapping. However, in practice, it will most probably only exist as one-to-one (or maybe one-to-few) map, and thus one level of unnecessary indirection.

I have not checked all (core) functionality or rather the respective queries including nextrun, but pretty much all usage of the Job::get_jobs_by_query method that results in MySQL queries including nextrun BETWEEN ... AND ... or nextrun < ..., will mean a huge performance burden.

Here is a query taken from a live project:

SELECT *
FROM wp_cavalcade_jobs
WHERE site = 19
    AND hook = 'batch_export_pages_0'
    AND args = '<SERIALIZED-ARRAY>'
    AND nextrun BETWEEN '1970-01-01 00:00:00' AND '2020-11-27 08:48:16'
    AND status IN( 'waiting', 'running' )
ORDER BY nextrun ASC
LIMIT 1
;

Given a table with 1.7 million jobs, executing the above query takes about 4 seconds!

The 1970-01-01 00:00:00 timestamp—the Beginning of Time 😉—is due to using 0 in case when an event's timestamp is more than 10 minutes in the past. However, the date range does not really pose an issue, and using something like 2020-01-01 00:00:00 instead will perform just as badly.

Possible Solutions

I see, at least, two ways to go about this:

If we can agree that the index is, in fact, useless, given the nature of time, let's just delete it.

If, on the other hand, there is something to gain by having the index—I'm looking forward to learning what that would look like—I suggest to ignore the index for queries that pose problematic. This can be done by enhancing Job::get_jobs_by_query to add IGNORE INDEX (nextrun) to the query for a non-integer timestamp, that is, the 'past' string, or a range of timestamps.

Granted, there is a third option, which is to regularly truncate (or clean up) the jobs table. Oftentimes, however, this is not possible or desired.

Data

Here is some real-world data, underlining the above suggestion of ignoring/deleting the index. For all queries, I ran RESET QUERY CACHE; right before them, to not have differing results due to cached data.

Given a jobs table with almost 1 million rows (960,764), will execute the above query as follows:

id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE wp_cavalcade_jobs range status, site, hook, nextrun nextrun 5 NULL 480467 Using index condition; Using where

As stated, this will search a total of 480,467 rows, which is about half the size of the whole table!

Execution, on the project's staging server, took 4.33114400 seconds.

This is the detailed profile (using SET profiling = 1 and SHOW PROFILE FOR QUERY #):

Status Duration
starting 0.000014
checking query cache for query 0.000117
checking permissions 0.000009
Opening tables 0.000026
init 0.000048
System lock 0.000041
optimizing 0.000019
statistics 0.000151
preparing 0.000043
Sorting result 0.000005
executing 0.000004
Sending data 4.330501
end 0.000021
query end 0.000011
closing tables 0.000015
freeing items 0.000053
storing result in query cache 0.000011
logging slow query 0.000046
cleaning up 0.000008
cleaned up 0.000004

The exact same query, now using IGNORE INDEX (nextrun), will be executed like so:

id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE wp_cavalcade_jobs range status, site, hook status 257 NULL 448 Using index condition; Using where; Using filesort

This is using filesort, and/but will only search a total of 448 rows! In this example, this is even less than 1% of the previous query using the index (to no avail).

The profile looks as follows:

Status Duration
starting 0.000011
checking query cache for query 0.000098
checking permissions 0.000007
Opening tables 0.000019
init 0.000048
System lock 0.000039
optimizing 0.000514
statistics 0.000137
preparing 0.000036
Sorting result 0.000007
executing 0.000004
Sending data 0.000013
Creating sort index 0.002024
end 0.000010
query end 0.000008
closing tables 0.000017
freeing items 0.000036
storing result in query cache 0.000016
cleaning up 0.000005
cleaned up 0.000003

Executing the query, overall, takes 0.00305075 seconds, or 3 milliseconds, which is less than 0.1% of the previous query execution time.

Thoughts?

I'm happy to hear anyone's thoughts on the current state, the problem that I explained, as well as any kind of possible solution! Especially interested to hear @peterwilsoncc's reasons for introducing the nextrun index. 🙂

Nikschavan commented 3 years ago

We dropped the index nextrun on the table mentioned in the issue description and saw a dramatic improvement in the query performance and hence database load/Cavalcade job execution throughput.

The RDS CPU usage dropped from ~80% to 4% immediately after dropping the index -

image

And cavalcade jobs execution throughput can be seen in this screenshot, This was executing ~130 jobs per hours from this it executed 1.5K jobs in ~30 minutes, only change in between was the dropped index -

image

peterwilsoncc commented 3 years ago

Thanks for the detailed ticket and graphs folks.

The new indexes where added as a follow up to #75 as more complex queries were been made as the site loaded. The theory was that it would speed them up. Evidently this turned out to be incorrect for larger sites or once the jobs table grew to include a number of failed and completed jobs.

As Cavalcade is designed for larger sites, I am happy if it's removed.

As a completely ignorant question, does the order of the WHERE clause have an effect, ie is this faster:

SELECT *
FROM wp_cavalcade_jobs
WHERE site = 19
    AND hook = 'batch_export_pages_0'
    AND args = '<SERIALIZED-ARRAY>'
    AND status IN( 'waiting', 'running' )
    AND nextrun BETWEEN '1970-01-01 00:00:00' AND '2020-11-27 08:48:16'
ORDER BY nextrun ASC
LIMIT 1
;

Also, I gather the above is run while scheduling a single task within the next ten minutes or a zero timestamp. The time clause when scheduling such tasks could be simplified too:

SELECT *
FROM wp_cavalcade_jobs
WHERE site = 19
    AND hook = 'batch_export_pages_0'
    AND args = '<SERIALIZED-ARRAY>'
    AND status IN( 'waiting', 'running' )
    AND nextrun <= '2020-11-27 08:48:16'
ORDER BY nextrun ASC
LIMIT 1
tfrommen commented 3 years ago

As a completely ignorant question, does the order of the WHERE clause have an effect, ie is this faster:

SELECT *
FROM wp_cavalcade_jobs
WHERE site = 19
  AND hook = 'batch_export_pages_0'
  AND args = '<SERIALIZED-ARRAY>'
  AND status IN( 'waiting', 'running' )
  AND nextrun BETWEEN '1970-01-01 00:00:00' AND '2020-11-27 08:48:16'
ORDER BY nextrun ASC
LIMIT 1
;

I haven't tested, but I think this wouldn't make a difference, as the nextrun column is still being used for both querying (in the WHERE clause) and sorting (in the ORDER BY clause).

In general, MySQL will pick the index that means to search the smallest number of rows. Looking at the Cavalcade jobs table, this can be different across projects, though.

As I mentioned before, it would be possible to use IGNORE INDEX (nextrun) to tell MySQL not to use the nextrun index. Similarly, adding USE INDEX (status) would tell MySQL to use status, and thus not nextrun.

But I am still not convinved the nextrun index brings any value at all. Using Cavalcade in a meanigful way, I don't think you would ever ask the database something like "Give me the next scheduled job after timestamp X". Instead, you would query for waiting or running jobs, or completed ones. Or you would ask for the time that a particular job is scheduled next. Or something like that.

From that same project, this is the TOP 5 of queries over the last five days (other than the above one):

None of them would make good use of the nextrun index, I don't think.

Would be good to also get some input from either @rmccue or @roborourke...

roborourke commented 3 years ago

Thanks for this in depth analysis @tfrommen and @Nikschavan. This seems like an overwhelming argument to remove the index.

Let's make that happen!

rmccue commented 3 years ago

Agreed, let's drop it.

roborourke commented 3 years ago

Thanks @tfrommen - this is now available in v2.0.2

peterwilsoncc commented 3 years ago

Thanks folks, sorry I broke your product.

roborourke commented 3 years ago

EABEC9A8-E2BF-4222-A687-EF9EEAE4100A

rmccue commented 3 years ago

@peterwilsoncc I'll send you the invoice.

archon810 commented 3 years ago

So I upgraded from 2.0.1 to 2.0.2 (using composer), and I'm still seeing the nextrun index. Am I expected to do something to trigger the db migration that drops it?

tfrommen commented 3 years ago

@archon810 you would need to run wp cavalcade upgrade (or trigger the upgrade routine some other way).

archon810 commented 3 years ago

@tfrommen Thank you, that did it. What other ways trigger the upgrade routine? Shouldn't it run automatically at some point after the update? This is probably something that should be in release notes, otherwise nobody's going to get the benefits of the db change unless they know to run the upgrade command (and things would be a lot more disastrous if the db structure actually contained breaking changes).

rmccue commented 3 years ago

@archon810 It doesn't run automatically as the upgrade operation can affect the database in load-affecting ways, such as locking writes or just doing a lot of writes. Cavalcade is designed to continue operating with an non-upgraded database (similarly to WordPress).

We can probably do a better job of documenting this though. :)

rmccue commented 3 years ago

Updated the release notes for 2.0.2.