iftechfoundation / ifdb-suggestion-tracker

Bugs and feature requests for a future IFDB update
10 stars 0 forks source link

Slow Query: Search by text does a full scan of `games` table #404

Closed dfabulich closed 6 months ago

dfabulich commented 9 months ago

I'm not sure what we could do about this… Some kinda full-text search index?

# Time: 231010  4:52:59
# User@Host: ifdb[ifdb] @ localhost [127.0.0.1]
# Thread_id: 6588605  Schema: ifdb  QC_hit: No
# Query_time: 27.193154  Lock_time: 0.000733  Rows_sent: 20  Rows_examined: 228059
# Rows_affected: 0  Bytes_sent: 7141
# Tmp_tables: 5  Tmp_disk_tables: 1  Tmp_table_sizes: 7036508
# Full_scan: Yes  Full_join: No  Tmp_table: Yes  Tmp_table_on_disk: Yes
# Filesort: Yes  Filesort_on_disk: No  Merge_passes: 1  Priority_queue: No
#
# explain: id   select_type table   type    possible_keys   key key_len ref rows    r_rows  filtered    r_filtered  Extra
# explain: 1    PRIMARY games   ALL title_2,title   NULL    NULL    NULL    12809   12809.00    100.00  0.53    Using where; Using temporary; Using filesort
# explain: 1    PRIMARY <derived4>  ref key0    key0    34  ifdb.games.id   10  0.82    100.00  100.00  
# explain: 4    DERIVED <derived5>  ALL NULL    NULL    NULL    NULL    68110   22599.00    100.00  100.00  Using temporary; Using filesort
# explain: 5    DERIVED reviews ALL NULL    NULL    NULL    NULL    68110   68110.00    100.00  93.12   Using where; Using temporary; Using filesort
# explain: 5    DERIVED users   eq_ref  PRIMARY PRIMARY 34  ifdb.reviews.userid 1   1.00    100.00  100.00  Using where
#
SET timestamp=1696913579;
select sql_calc_found_rows
              games.id as id,
                       games.title as title,
                       games.author as author,
                       games.tags as tags,
                       games.moddate as moddate,
                       games.system as devsys,
                       if (time(games.published) = '00:00:00',
                           date_format(games.published, '%Y'),
                           date_format(games.published, '%M %e, %Y'))
                         as pubfmt,
                       if (time(games.published) = '00:00:00',
                           date_format(games.published, '%Y'),
                           date_format(games.published, '%Y-%m-%d'))
                         as published,
                       date_format(games.published, '%Y') as pubyear,
                       (games.coverart is not null) as hasart,
                       avgRating as avgrating,
                       numRatingsInAvg as ratingcnt,
                       stdDevRating as ratingdev,
                       numRatingsTotal,
                       numMemberReviews,
                       games.sort_title as sort_title,
                       games.sort_author as sort_author,
                       ifnull(games.published, '9999-12-31') as sort_pub,
                       games.flags
              , match (games.id, title, author, `desc`, tags) against ('Zork' in boolean mode) as relevance
            from
              games
                      left join gameRatingsSandbox0 on games.id = gameid
            where
              ( (match (games.id, title, author, `desc`, tags) against ('Zork' in boolean mode) or (title like 'Zork'  or ( title like '%Zork%'))))

            group by games.id

            order by
              if(title = 'Zork',0,1),if(title like 'Zork%',0,1),if((title like 'Zork'  or ( title like '%Zork%')),0,1),relevance desc,
              sort_title
            limit 0, 20;
salty-horse commented 6 months ago

This might be a regression from Search by TUID #254 (commit ebd1eecf95a86ed95cb92c6e5f6bea0fa68634f8), which added games.id to the list of columns to match.

The columns must be identical to the list in the fulltext index, or it won't get used. See Full-Text Restrictions:

The MATCH() column list must match exactly the column list in some FULLTEXT index definition for the table, unless this MATCH() is IN BOOLEAN MODE on a MyISAM table. For MyISAM tables, boolean-mode searches can be done on nonindexed columns, although they are likely to be slow.

~~Does anyone actually use the "search by TUID" functionality, or was it just added to fix the popup for searching games? The popup's query can be changed to search for "X tuid:X" the only search term X is a 16-char alphanumeric string.~~

EDIT: The suggestion won't work, since it ANDs the tuid search with the X.

Also, as mentioned in https://github.com/iftechfoundation/ifdb-suggestion-tracker/issues/300#issuecomment-1353480657, the instruction above the pop-up is wrong, since you can't search by IFID without using an ifid prefix. Whatever the solution is, it would need to be edited.

Several options:

  1. Rebuilding the (title, author, desc, tags) index with the id column, and editing the instruction above the pop-up to say it also searches author names and tags.
  2. Creating a new index with (id, title).
  3. Editing the instruction above the popup, saying that to search by TUID or IFID you need to only specify a single value, and the, which gets translated into a series of queries that first attempt a tuid/ifid search, before trying a regular search.
dfabulich commented 6 months ago

You're right, but that's not the entire problem. Starting with the game-ratings-materialized-view branch, I reverted https://github.com/iftechfoundation/ifdb/commit/ebd1eecf95a86ed95cb92c6e5f6bea0fa68634f8 and tried a simple search for Zork http://localhost:8080/search?searchbar=Zork.

It ran this query:

select
    games.id as id,
    games.title as title,
    games.author as author,
    games.tags as tags,
    games.moddate as moddate,
    games.system as devsys,
    if (time(games.published) = '00:00:00',
        date_format(games.published, '%Y'),
        date_format(games.published, '%M %e, %Y'))
        as pubfmt,
    if (time(games.published) = '00:00:00',
        date_format(games.published, '%Y'),
        date_format(games.published, '%Y-%m-%d'))
        as published,
    date_format(games.published, '%Y') as pubyear,
    (games.coverart is not null) as hasart,
    avgRating as avgrating,
    numRatingsInAvg as ratingcnt,
    stdDevRating as ratingdev,
    numRatingsTotal,
    numMemberReviews,
    games.sort_title as sort_title,
    games.sort_author as sort_author,
    ifnull(games.published, '9999-12-31') as sort_pub,
    games.flags
    , match (title, author, `desc`, tags) against ('Zork' in boolean mode) as relevance
from
    games
    left join gameRatingsSandbox0_mv on games.id = gameid
where
    ( (match (title, author, `desc`, tags) against ('Zork' in boolean mode) or (title like 'Zork'  or ( title like '%Zork%'))))

order by
    if(title = 'Zork',0,1),if(title like 'Zork%',0,1),if((title like 'Zork'  or ( title like '%Zork%')),0,1),relevance desc

limit 0, 20;

When I analyze that query, it says that it considered using the fulltext indexes, but decided against it.

+------+-------------+------------------------+--------+---------------+---------+---------+---------------+-------+----------+----------+------------+----------------------------------------------+
| id   | select_type | table                  | type   | possible_keys | key     | key_len | ref           | rows  | r_rows   | filtered | r_filtered | Extra                                        |
+------+-------------+------------------------+--------+---------------+---------+---------+---------------+-------+----------+----------+------------+----------------------------------------------+
|    1 | SIMPLE      | games                  | ALL    | title_2,title | NULL    | NULL    | NULL          | 13007 | 13007.00 |   100.00 |       0.52 | Using where; Using temporary; Using filesort |
|    1 | SIMPLE      | gameRatingsSandbox0_mv | eq_ref | PRIMARY       | PRIMARY | 34      | ifdb.games.id | 1     | 0.91     |   100.00 |     100.00 |                                              |
+------+-------------+------------------------+--------+---------------+---------+---------+---------------+-------+----------+----------+------------+----------------------------------------------+

I believe this is happening because of the LIKE parts of the WHERE clause. When I go to searchutil.php and look at line 446:

$expr = "($matchMode ($matchExpr or $likeExpr))";

… and replace that line with:

$expr = "($matchMode ($matchExpr))";

It then generates a query without the LIKE clauses:

select
    games.id as id,
    games.title as title,
    games.author as author,
    games.tags as tags,
    games.moddate as moddate,
    games.system as devsys,
    if (time(games.published) = '00:00:00',
        date_format(games.published, '%Y'),
        date_format(games.published, '%M %e, %Y'))
        as pubfmt,
    if (time(games.published) = '00:00:00',
        date_format(games.published, '%Y'),
        date_format(games.published, '%Y-%m-%d'))
        as published,
    date_format(games.published, '%Y') as pubyear,
    (games.coverart is not null) as hasart,
    avgRating as avgrating,
    numRatingsInAvg as ratingcnt,
    stdDevRating as ratingdev,
    numRatingsTotal,
    numMemberReviews,
    games.sort_title as sort_title,
    games.sort_author as sort_author,
    ifnull(games.published, '9999-12-31') as sort_pub,
    games.flags
    , match (title, author, `desc`, tags) against ('Zork' in boolean mode) as relevance
from
    games
    left join gameRatingsSandbox0_mv on games.id = gameid
where
    ( (match (title, author, `desc`, tags) against ('Zork' in boolean mode)))

order by
    if(title = 'Zork',0,1),if(title like 'Zork%',0,1),if((title like 'Zork'  or ( title like '%Zork%')),0,1),relevance desc

limit 0, 20;

… and that query is using the fulltext index when I ANALYZE it.

+------+-------------+------------------------+----------+---------------+---------+---------+---------------+------+--------+----------+------------+----------------------------------------------+
| id   | select_type | table                  | type     | possible_keys | key     | key_len | ref           | rows | r_rows | filtered | r_filtered | Extra                                        |
+------+-------------+------------------------+----------+---------------+---------+---------+---------------+------+--------+----------+------------+----------------------------------------------+
|    1 | SIMPLE      | games                  | fulltext | title         | title   | 0       |               | 1    | 68.00  |   100.00 |     100.00 | Using where; Using temporary; Using filesort |
|    1 | SIMPLE      | gameRatingsSandbox0_mv | eq_ref   | PRIMARY       | PRIMARY | 34      | ifdb.games.id | 1    | 0.91   |   100.00 |     100.00 |                                              |
+------+-------------+------------------------+----------+---------------+---------+---------+---------------+------+--------+----------+------------+----------------------------------------------+

Now, what's really confusing about that is that I would have therefore expected different results when removing the LIKE clause, but, to my surprise, it didn't change anything in the results.

I thought that maybe a search for ounterfeit Monkey without the C would trigger different results, but, no, Counterfeit Monkey returned right at the top; the results seem identical to the results we'd get from including the LIKE clause.

I… don't really understand this yet, but I'll put together a PR and try to investigate it later.