Closed jdeluyck closed 7 years ago
I haven't thought about 'A'. But 'The' got me a little irritated a while back so I'm writing my movies like: Mechanic, The
There are two possible solutions for this problem.
It also needs to be adaptable for other languages, so that you can remove "Les" from "Les Misérables".
Need to benchmark ORDER BY, don't have a table with 10k movies in it.
I don't know if it'd be an option to foresee a list of prefix words via translation that could be filtered out?
I'm thinking about adding $settings["order_by"] = 'A|An|The'
that will be part of a regular expression, so users can add their own. Will be pretty hard to maintain specific list for all languages.
I have now benchmarked 10384 rows. And this is the result:
SELECT SQL_CALC_FOUND_ROWS `id`,`imdbid`,`name`,`format`,`own`,`seen` FROM `movies` WHERE 1 = 1 ORDER BY year desc, (CASE WHEN `name` REGEXP '^(A|An|The)[[:space:]]' = 1 THEN TRIM(SUBSTR(`name` , INSTR(`name` ,' '))) ELSE `name` END)
Benchmark
Average number of seconds to run all queries: 59.662 seconds
Minimum number of seconds to run all queries: 39.687 seconds
Maximum number of seconds to run all queries: 91.766 seconds
Number of clients running queries: 50
Average number of queries per client: 1
SELECT SQL_CALC_FOUND_ROWS `id`,`imdbid`,`name`,`format`,`own`,`seen` FROM `movies` WHERE 1 = 1 ORDER BY year desc, `name`
Benchmark
Average number of seconds to run all queries: 5.948 seconds
Minimum number of seconds to run all queries: 5.687 seconds
Maximum number of seconds to run all queries: 6.094 seconds
Number of clients running queries: 50
Average number of queries per client: 1
So I need to save movies without A,An,The for sorting; if this is going to be practical for large tables. If you only use it for internal use or got a small database, you can patch your own Movies.class.php
into this:
if($sort == "name asc") {
$query .= " ORDER BY (CASE WHEN `name` REGEXP '^(A|An|The)[[:space:]]' = 1 THEN TRIM(SUBSTR(`name` , INSTR(`name` ,' '))) ELSE `name` END) ASC";
} else if($sort == "name desc") {
$query .= " ORDER BY (CASE WHEN `name` REGEXP '^(A|An|The)[[:space:]]' = 1 THEN TRIM(SUBSTR(`name` , INSTR(`name` ,' '))) ELSE `name` END) DESC";
} else if($sort != "") {
$query .= " ORDER BY ".$sort.", (CASE WHEN `name` REGEXP '^(A|An|The)[[:space:]]' = 1 THEN TRIM(SUBSTR(`name` , INSTR(`name` ,' '))) ELSE `name` END)";
}
Same query with 472 rows (or if you limit it per page)
SELECT SQL_CALC_FOUND_ROWS `id`,`imdbid`,`name`,`format`,`own`,`seen` FROM `movies` WHERE 1 = 1 ORDER BY year desc, (CASE WHEN `name` REGEXP '^(A|An|The)[[:space:]]' = 1 THEN TRIM(SUBSTR(`name` , INSTR(`name` ,' '))) ELSE `name` END);
Benchmark
Average number of seconds to run all queries: 0.197 seconds
Minimum number of seconds to run all queries: 0.094 seconds
Maximum number of seconds to run all queries: 1.094 seconds
Number of clients running queries: 50
Average number of queries per client: 1
SELECT SQL_CALC_FOUND_ROWS `id`,`imdbid`,`name`,`format`,`own`,`seen` FROM `movies` WHERE 1 = 1 ORDER BY year desc, `name`;
Benchmark
Average number of seconds to run all queries: 0.128 seconds
Minimum number of seconds to run all queries: 0.015 seconds
Maximum number of seconds to run all queries: 1.031 seconds
Number of clients running queries: 50
Average number of queries per client: 1
Ouch, the performance penaly is quite hefty.
I'm nearly wondering if it wouldn't be better to get the list as-is, remove the prefixes in php code, and resort it?
I don't have the dataset to test it on, though.
Here is a Apache benchmark, so it's only 0.02 faster (with MySQL cached). Using 7 different queries, with 700 requests (100 on every query). Don't want to run 500 different queries (benchmark from before), so not a real world example here. So it will be slower, due to the fact that MySQL can't cache something it never have run before.
SELECT SQL_CALC_FOUND_ROWS `id`,`imdbid`,`name`,`format`,`own`,`seen` FROM `movies` WHERE 1 = 1 ORDER BY year desc, (CASE WHEN `name` REGEXP '^(A|An|The)[[:space:]]' = 1 THEN TRIM(SUBSTR(`name` , INSTR(`name` ,' '))) ELSE `name` END);
Requests per second: 0.29 [#/sec] (mean)
Time per request: 3440.579 [ms] (mean)
SELECT SQL_CALC_FOUND_ROWS `id`,`imdbid`,`name`,`format`,`own`,`seen` FROM `movies` WHERE 1 = 1 ORDER BY year desc, `name`;
Requests per second: 0.31 [#/sec] (mean)
Time per request: 3228.066 [ms] (mean)
If you are just making one request (visit) it will load the homepage with 0.1 sec delay (according to smarty). So in the real world you won't notice it at all. If you don't publish it for the world to see...
If we resort it with PHP, we are just moving the problem. We need to make the same regex there and later on sort the array two times (year and name). It's possible to sort a multi array, but never done it with two values.
So option 2 is the only viable option. For good programming practices. Stripping A|An|The and store it in a separate column. Should it be named namesort
, nameorder
or nameorderby
. Or just sort
?
sortingorder
perhaps?
I kinda want name
in there so you know where it's from. How about namesuffix
, due to the fact we delete some prefix?
Might be confusing. namesortorder
, nameorder
then?
It would be great if you could allow sorting without the 'The' prefix ;) Perhaps also without 'A'.