jreklund / php4dvd

php4dvd is an open source php/mysql powered movie database. Catalog your video collection with ease. Automatic update of information and images.
GNU General Public License v3.0
83 stars 43 forks source link

Sorting without 'The' prefix #4

Closed jdeluyck closed 7 years ago

jdeluyck commented 7 years ago

It would be great if you could allow sorting without the 'The' prefix ;) Perhaps also without 'A'.

jreklund commented 7 years ago

I haven't thought about 'A'. But 'The' got me a little irritated a while back so I'm writing my movies like: Mechanic, The

There are two possible solutions for this problem.

  1. Remove a,an,the from ORDER BY
  2. Then saving a movie, create a specific pre movie title, stripped from a,an,the

It also needs to be adaptable for other languages, so that you can remove "Les" from "Les Misérables".

Need to benchmark ORDER BY, don't have a table with 10k movies in it.

jdeluyck commented 7 years ago

I don't know if it'd be an option to foresee a list of prefix words via translation that could be filtered out?

jreklund commented 7 years ago

I'm thinking about adding $settings["order_by"] = 'A|An|The' that will be part of a regular expression, so users can add their own. Will be pretty hard to maintain specific list for all languages.

I have now benchmarked 10384 rows. And this is the result:

SELECT SQL_CALC_FOUND_ROWS `id`,`imdbid`,`name`,`format`,`own`,`seen` FROM `movies` WHERE 1 = 1 ORDER BY year desc, (CASE WHEN `name` REGEXP '^(A|An|The)[[:space:]]' = 1 THEN TRIM(SUBSTR(`name` , INSTR(`name` ,' '))) ELSE `name` END)
Benchmark
        Average number of seconds to run all queries: 59.662 seconds
        Minimum number of seconds to run all queries: 39.687 seconds
        Maximum number of seconds to run all queries: 91.766 seconds
        Number of clients running queries: 50
        Average number of queries per client: 1

SELECT SQL_CALC_FOUND_ROWS `id`,`imdbid`,`name`,`format`,`own`,`seen` FROM `movies` WHERE 1 = 1 ORDER BY year desc, `name`
Benchmark
        Average number of seconds to run all queries: 5.948 seconds
        Minimum number of seconds to run all queries: 5.687 seconds
        Maximum number of seconds to run all queries: 6.094 seconds
        Number of clients running queries: 50
        Average number of queries per client: 1

So I need to save movies without A,An,The for sorting; if this is going to be practical for large tables. If you only use it for internal use or got a small database, you can patch your own Movies.class.php into this:

if($sort == "name asc") {
    $query .= " ORDER BY (CASE WHEN `name` REGEXP '^(A|An|The)[[:space:]]' = 1 THEN TRIM(SUBSTR(`name` , INSTR(`name` ,' '))) ELSE `name` END) ASC";
} else if($sort == "name desc") {
    $query .= " ORDER BY (CASE WHEN `name` REGEXP '^(A|An|The)[[:space:]]' = 1 THEN TRIM(SUBSTR(`name` , INSTR(`name` ,' '))) ELSE `name` END) DESC";
} else if($sort != "") {
    $query .= " ORDER BY ".$sort.", (CASE WHEN `name` REGEXP '^(A|An|The)[[:space:]]' = 1 THEN TRIM(SUBSTR(`name` , INSTR(`name` ,' '))) ELSE `name` END)";
}

Same query with 472 rows (or if you limit it per page)

SELECT SQL_CALC_FOUND_ROWS `id`,`imdbid`,`name`,`format`,`own`,`seen` FROM `movies` WHERE 1 = 1 ORDER BY year desc, (CASE WHEN `name` REGEXP '^(A|An|The)[[:space:]]' = 1 THEN TRIM(SUBSTR(`name` , INSTR(`name` ,' '))) ELSE `name` END);
Benchmark
        Average number of seconds to run all queries: 0.197 seconds
        Minimum number of seconds to run all queries: 0.094 seconds
        Maximum number of seconds to run all queries: 1.094 seconds
        Number of clients running queries: 50
        Average number of queries per client: 1

SELECT SQL_CALC_FOUND_ROWS `id`,`imdbid`,`name`,`format`,`own`,`seen` FROM `movies` WHERE 1 = 1 ORDER BY year desc, `name`;
Benchmark
        Average number of seconds to run all queries: 0.128 seconds
        Minimum number of seconds to run all queries: 0.015 seconds
        Maximum number of seconds to run all queries: 1.031 seconds
        Number of clients running queries: 50
        Average number of queries per client: 1
jdeluyck commented 7 years ago

Ouch, the performance penaly is quite hefty.

I'm nearly wondering if it wouldn't be better to get the list as-is, remove the prefixes in php code, and resort it?

I don't have the dataset to test it on, though.

jreklund commented 7 years ago

Here is a Apache benchmark, so it's only 0.02 faster (with MySQL cached). Using 7 different queries, with 700 requests (100 on every query). Don't want to run 500 different queries (benchmark from before), so not a real world example here. So it will be slower, due to the fact that MySQL can't cache something it never have run before.

SELECT SQL_CALC_FOUND_ROWS `id`,`imdbid`,`name`,`format`,`own`,`seen` FROM `movies` WHERE 1 = 1 ORDER BY year desc, (CASE WHEN `name` REGEXP '^(A|An|The)[[:space:]]' = 1 THEN TRIM(SUBSTR(`name` , INSTR(`name` ,' '))) ELSE `name` END);
Requests per second:    0.29 [#/sec] (mean)
Time per request:       3440.579 [ms] (mean)

SELECT SQL_CALC_FOUND_ROWS `id`,`imdbid`,`name`,`format`,`own`,`seen` FROM `movies` WHERE 1 = 1 ORDER BY year desc, `name`;
Requests per second:    0.31 [#/sec] (mean)
Time per request:       3228.066 [ms] (mean)

If you are just making one request (visit) it will load the homepage with 0.1 sec delay (according to smarty). So in the real world you won't notice it at all. If you don't publish it for the world to see...

If we resort it with PHP, we are just moving the problem. We need to make the same regex there and later on sort the array two times (year and name). It's possible to sort a multi array, but never done it with two values.

So option 2 is the only viable option. For good programming practices. Stripping A|An|The and store it in a separate column. Should it be named namesort , nameorder or nameorderby. Or just sort?

jdeluyck commented 7 years ago

sortingorder perhaps?

jreklund commented 7 years ago

I kinda want name in there so you know where it's from. How about namesuffix, due to the fact we delete some prefix?

jdeluyck commented 7 years ago

Might be confusing. namesortorder, nameorder then?