msaari / relevanssi

Relevanssi, a WordPress plugin to improve the search
GNU General Public License v3.0
49 stars 21 forks source link

another index performance issue related to #25 #26

Closed asakous closed 3 years ago

asakous commented 3 years ago

assume relevanssi table has millions record in order to retrieve un-index post id . the query (see below) need to be execute each loop and it will perform slower and slower . my solution is create another table that only record indexed post id (need modify code)and create a pk key for the column. after that change left join relevanssi table to what I just created. theory is quite simple. instead of join a big table why not just join small one.

$q = "SELECT post.ID FROM $wpdb->posts post LEFT JOIN $wpdb->posts parent ON (post.post_parent=parent.ID) LEFT JOIN $relevanssi_table r ON (post.ID=r.doc)

msaari commented 3 years ago

In general Relevanssi is not recommended for cases where the table has millions of records. That may work, if the hardware is up to it, but I've seen a case where on a shared hosting a site with four million rows in the wp_relevanssi table spent 75 seconds doing a search. Having the indexing run faster does not help there at all. (With good hardware, apparently indexing millions of posts is possible.)

My test site only has 250k rows in the wp_relevanssi table. There the lookup time increases steadily up to a point, then fluctuates without growing much. It doesn't get to problematic levels there. What kind of query times are you seeing? If you log the times in relevanssi_build_index() like this, how does it look like?

    $time_before = microtime();
    $content = $wpdb->get_results( $query ); // phpcs:ignore WordPress.DB.PreparedSQL.NotPrepared
    $time_spent = microtime() - $time_before;
    error_log("Time spent: $time_spent");

I don't mind making the indexing faster, but adding another table is a very big complication in the works, and one I'd like to avoid as far as possible. I would need to see some numbers on how bad the problem is, and how big a difference the change can make – in my case this query takes 1.29 seconds of time, when the whole indexing process takes 309 seconds. This is less than half a percent of the time spent, so it does feel like overoptimizing. But with a big database, the situation may be different. Can you provide some data on this?

asakous commented 3 years ago

same hardware 400K+ record. I just change left join table and insert new post_id to new table. it seems double the performance at least.

the result Time spent:00:00:05.670 indexed:20 Time spent:00:00:09.724 indexed:20 Time spent:00:00:08.804 indexed:20 Time spent:00:00:08.012 indexed:20 Time spent:00:00:06.270 indexed:20 Time spent:00:00:18.588 indexed:20 Time spent:00:00:12.463 indexed:15 Time spent:00:00:26.075 indexed:10 Time spent:00:00:06.694 indexed:5 Time spent:00:00:00.694 indexed:5 Time spent:00:00:02.435 indexed:11 Time spent:00:00:07.737 indexed:16 Time spent:00:00:08.043 indexed:16 Time spent:00:00:05.737 indexed:16 Time spent:00:00:06.892 indexed:16 Time spent:00:00:04.763 indexed:16 Time spent:00:00:05.993 indexed:16 Time spent:00:00:08.580 indexed:16 -------------------------------------------- below new method Time spent:00:00:01.518 indexed:9 Time spent:00:00:04.584 indexed:19 Time spent:00:00:04.091 indexed:24 Time spent:00:00:05.226 indexed:29 Time spent:00:00:03.573 indexed:29 Time spent:00:00:05.087 indexed:34 Time spent:00:00:04.928 indexed:34 Time spent:00:00:04.644 indexed:34 Time spent:00:00:07.229 indexed:39 Time spent:00:00:06.947 indexed:39 Time spent:00:00:05.692 indexed:39 Time spent:00:00:04.733 indexed:39

msaari commented 3 years ago

It's faster, yes, but if you compare the total indexing time, how much faster is that? Is there a significant difference in the total indexing time? What exactly did you change here?

asakous commented 3 years ago

total indexing time. same hardware 2116391 record in the table and I delete 150654 record. re-index unindexed post

Time elapsed: 0:05:13 | Time remaining: we're done!-->my join table method Time elapsed: 0:09:15 | Time remaining: we're done!

my method only affect people who have millions record in the table. for the small table. don't bother to change it.

msaari commented 3 years ago

Well, that is a significant difference there. On the other hand, nine minutes is much faster than I expected in the first place.

What exactly did you do here? I'd like to give it a go.

asakous commented 3 years ago

1: CREATE TABLE wp_relevanssi_id ( doc int(11) NOT NULL DEFAULT '0', PRIMARY KEY (doc) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci; 2:indexing.php if ( ! empty( $values ) ) { $values = implode( ', ', $values ); $query = "INSERT IGNORE INTO $relevanssi_table (doc, term, term_reverse, content, title, comment, tag, link, author, category, excerpt, taxonomy, customfield, type, taxonomy_detail, customfield_detail, mysqlcolumn) VALUES $values"; $query2 = "INSERT IGNORE INTO wp_relevanssi_id (doc) VALUES (".$post->ID.") ";-->add if ( $debug ) { relevanssi_debug_echo( "Final indexing query:\n\t$query" ); }

    $wpdb->query( $query2 );-->add 
    $wpdb->query( $query ); // phpcs:ignore WordPress.DB.PreparedSQL.NotPrepared

}

3:indexing.php $q = "SELECT post.ID FROM $wpdb->posts post LEFT JOIN $wpdb->posts parent ON (post.post_parent=parent.ID) LEFT JOIN wp_relevanssi_id r ON (post.ID=r.doc)-->change table name WHERE $processed_post_filter AND (post.post_status IN ($valid_status) OR (post.post_status='inherit' AND( (parent.ID is not null AND (parent.post_status IN ($valid_status))) OR (post.post_parent=0) ) ) )

not implemented
rebuild whole index should truncate table wp_relevanssi_id too.

another possible optimization is that when people try to rebuild whole index don't truncate table first. the rebuild flow should be 1:create a identical relevanssi temporary table name relevanssi_temp 2:insert index to relevanssi_temp 3: when build index completed , rename relevanssi table to relevanssi_temp_b rename relevanssi_temp to relevanssi
rename table relevanssi_temp_b to relevanssi_temp truncate relevanssi_temp that way to aviod search downtime

msaari commented 3 years ago

Thanks. I'll try this myself. I was thinking it might be possible to implement this as an add-on plugin that could be used when necessary. That might make sense. I'll have to think about this.

Your optimization is a nice trick, but Relevanssi is already taking up so much database space that doubling that – even momentarily – is probably too much.

asakous commented 3 years ago

thanks.