ppy / osu-infrastructure

40 stars 5 forks source link

Investigate data size concerns for adding a new index to `solo_scores` #17

Closed peppy closed 1 year ago

peppy commented 1 year ago

Database structure testing

As per #15, we need an index on (user_id, ruleset_id, beatmap_id) if using the database for operations like https://github.com/ppy/osu-queue-score-statistics/pull/149 (and in the future, ranked score processing and probably more).

I want to test on production data (scaled down slightly) to get a real-world idea of how adding/changing indices affect the size of this table.

Test whether changing primary key composition affects index size

TL;DR it doesn't seem to.

-- original
ALTER TABLE `solo_scores`
DROP PRIMARY KEY,
ADD PRIMARY KEY (`id`, `preserve`, `ruleset_id`, `created_at`);

MySQL root@(none):osu> SELECT
                    ->     table_name AS `Table`,
                    ->     ROUND(data_length / 1024 / 1024) AS `Data MB`,
                    ->     ROUND(index_length / 1024 / 1024) AS `Index MB`
                    -> FROM
                    ->     information_schema.tables
                    -> WHERE
                    ->     table_schema = 'osu' AND
                    ->     table_name = 'solo_scores';
+-------------+---------+----------+
| Table       | Data MB | Index MB |
+-------------+---------+----------+
| solo_scores | 1868    | 237      |
+-------------+---------+----------+

MySQL root@(none):osu> SHOW TABLE STATUS LIKE 'solo_scores';
+-------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+--------------------+----------+----------------------------------------------------+---------+
| Name        | Engine | Version | Row_format | Rows    | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Create_time         | Update_time | Check_time | Collation          | Checksum | Create_options                                     | Comment |
+-------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+--------------------+----------+----------------------------------------------------+---------+
| solo_scores | InnoDB | 10      | Compressed | 9523175 | 205            | 1958215680  | 0               | 248758272    | 3932160   | 2406267336     | 2023-09-01 01:24:18 | <null>      | <null>     | utf8mb4_0900_ai_ci | <null>   | row_format=COMPRESSED KEY_BLOCK_SIZE=4 partitioned |         |
+-------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+--------------------+----------+----------------------------------------------------+---------+
-- remove ruleset_id and see if index size changes
ALTER TABLE `solo_scores`
DROP PRIMARY KEY,
ADD PRIMARY KEY (`id`, `preserve`, `created_at`)

MySQL root@(none):osu> SELECT
                    ->     table_name AS `Table`,
                    ->     ROUND(data_length / 1024 / 1024) AS `Data MB`,
                    ->     ROUND(index_length / 1024 / 1024) AS `Index MB`
                    -> FROM
                    ->     information_schema.tables
                    -> WHERE
                    ->     table_schema = 'osu' AND
                    ->     table_name = 'solo_scores';
+-------------+---------+----------+
| Table       | Data MB | Index MB |
+-------------+---------+----------+
| solo_scores | 1867    | 236      |
+-------------+---------+----------+

MySQL root@(none):osu> SHOW TABLE STATUS LIKE 'solo_scores';
+-------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+--------------------+----------+----------------------------------------------------+---------+
| Name        | Engine | Version | Row_format | Rows    | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Create_time         | Update_time | Check_time | Collation          | Checksum | Create_options                                     | Comment |
+-------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+--------------------+----------+----------------------------------------------------+---------+
| solo_scores | InnoDB | 10      | Compressed | 9577016 | 204            | 1957953536  | 0               | 247709696    | 5505024   | 2406267336     | 2023-09-01 02:58:21 | <null>      | <null>     | utf8mb4_0900_ai_ci | <null>   | row_format=COMPRESSED KEY_BLOCK_SIZE=4 partitioned |         |
+-------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+--------------------+----------+----------------------------------------------------+---------+

Test whether existing index needs id DESC at end (or whether it can use the implicity primary key)

TL;DR it doesn't, it can use the primary key

explain SELECT * FROM solo_scores WHERE user_id = 2 AND ruleset_id = 0 ORDER BY id DESC LIMIT 50;

DROP INDEX `user_ruleset_id_index`;
ADD INDEX `user_ruleset_id_index` (`user_id`,`ruleset_id`);

explain SELECT * FROM solo_scores WHERE user_id = 2 AND ruleset_id = 0 ORDER BY id DESC LIMIT 50;

Test how badly inserting beatmap_id into the existing index (KEY user_ruleset_id_index (user_id,ruleset_id,id DESC)) breaks things (100% it will)

TL;DR it does, as expected. We can get around this by changing the existing KEY beatmap_id to include user_id at the end. This comes at almost zero storage cost due to the high cardinality of users per beatmap (almost completely unique). So we're just changing the ordering of the index rather than adding new overheads.

explain SELECT * FROM solo_scores WHERE user_id = 2 AND ruleset_id = 0 ORDER BY id DESC LIMIT 50;

MySQL root@(none):osu> explain SELECT `id` FROM `solo_scores_p` WHERE `user_id` = 19743981 AND `ruleset_id` = 0 and preserve in (0,1) ORDER BY `id`limit 10;
+----+-------------+---------------+------------+------+---------------+---------------+---------+-------------+---------+----------+--------------------------+
| id | select_type | table         | partitions | type | possible_keys | key           | key_len | ref         | rows    | filtered | Extra                    |
+----+-------------+---------------+------------+------+---------------+---------------+---------+-------------+---------+----------+--------------------------+
| 1  | SIMPLE      | solo_scores_p | p0catch,p1 | ref  | user_id_index | user_id_index | 6       | const,const | 1700684 | 20.0     | Using where; Using index |
+----+-------------+---------------+------------+------+---------------+---------------+---------+-------------+---------+----------+--------------------------+

10 rows in set
Time: 0.005s

DROP INDEX `user_ruleset_id_index`;
ADD INDEX `user_ruleset_id_index` (`user_id`,`ruleset_id`, `beatmap_id`, `id` DESC);

explain SELECT * FROM solo_scores WHERE user_id = 2 AND ruleset_id = 0 ORDER BY id DESC LIMIT 50;

MySQL root@(none):osu> explain SELECT `id` FROM `solo_scores_p` WHERE `user_id` = 19743981 AND `ruleset_id` = 0 and preserve
                    -> in (0,1) ORDER BY `id`limit 10;
+----+-------------+---------------+------------+-------+---------------+---------+---------+--------+------+----------+-------------+
| id | select_type | table         | partitions | type  | possible_keys | key     | key_len | ref    | rows | filtered | Extra       |
+----+-------------+---------------+------------+-------+---------------+---------+---------+--------+------+----------+-------------+
| 1  | SIMPLE      | solo_scores_p | p0catch,p1 | index | user_id_index | PRIMARY | 16      | <null> | 62   | 3.18     | Using where |
+----+-------------+---------------+------------+-------+---------------+---------+---------+--------+------+----------+-------------+

10 rows in set
Time: 0.498s

Test whether including ruleset_id in the constructed partitions (as touched on in #16) means we can remove it from indices

TL;DR correct, but I'm avoiding this because of concerns for future extensibility (custom rulesets). Also there's no storage saving on disk.


explain select * from solo_scores where user_id = 4937439 and ruleset_id = 3 and beatmap_id = 75;

ALTER TABLE `solo_scores`
    PARTITION BY RANGE COLUMNS(`preserve`, `ruleset_id`, created_at)
        (
        PARTITION p0r0catch VALUES LESS THAN (0,0,MAXVALUE) ENGINE = InnoDB,
        PARTITION p0r1catch VALUES LESS THAN (0,1,MAXVALUE) ENGINE = InnoDB,
        PARTITION p0r2catch VALUES LESS THAN (0,2,MAXVALUE) ENGINE = InnoDB,
        PARTITION p0r3catch VALUES LESS THAN (0,3,MAXVALUE) ENGINE = InnoDB,
        PARTITION p1r0 VALUES LESS THAN (1,0,MAXVALUE) ENGINE = InnoDB,
        PARTITION p1r1 VALUES LESS THAN (1,1,MAXVALUE) ENGINE = InnoDB,
        PARTITION p1r2 VALUES LESS THAN (1,2,MAXVALUE) ENGINE = InnoDB,
        PARTITION p1r3 VALUES LESS THAN (1,3,MAXVALUE) ENGINE = InnoDB
        );

-- working alternative?
ALTER TABLE `solo_scores`
 PARTITION BY list COLUMNS(`preserve`, `ruleset_id`)
     (
     PARTITION p0r0catch VALUES IN ((0,0)),
     PARTITION p0r1catch VALUES IN ((0,1)),
     PARTITION p0r2catch VALUES IN ((0,2)),
     PARTITION p0r3catch VALUES IN ((0,3)),
     PARTITION p1r0 VALUES IN ((1,0)),
     PARTITION p1r1 VALUES IN ((1,1)),
     PARTITION p1r2 VALUES IN ((1,2)),
     PARTITION p1r3 VALUES IN ((1,3))
     );

+-------------+----------------+---------+----------+
| Table       | AVG_ROW_LENGTH | Data MB | Index MB |
+-------------+----------------+---------+----------+
| solo_scores | 178            | 1635    | 342      |
+-------------+----------------+---------+----------+

MySQL root@(none):osu> explain select * from solo_scores where user_id = 4937439 and ruleset_id = 3 and beatmap_id = 75 and preserve = 1\G;
***************************[ 1. row ]***************************
id            | 1
select_type   | SIMPLE
table         | solo_scores
partitions    | p1r3
type          | ref
possible_keys | user_ruleset_id_index,beatmap_id
key           | user_ruleset_id_index
key_len       | 6
ref           | const,const
rows          | 1
filtered      | 5.0
Extra         | Using index condition; Using where

ALTER TABLE `solo_scores`
    DROP INDEX `user_ruleset_id_index`,
    ADD KEY `user_id_index` (`user_id`,`id` DESC);

+-------------+----------------+---------+----------+
| Table       | AVG_ROW_LENGTH | Data MB | Index MB |
+-------------+----------------+---------+----------+
| solo_scores | 178            | 1635    | 342      |
+-------------+----------------+---------+----------+

explain select * from solo_scores where user_id = 4937439 and ruleset_id = 3 and beatmap_id = 75;

Check why we have a beatmap_id index

Notes


MySQL root@(none):osu> explain select id from solo_scores where user_id = 19743981 and ruleset_id = 0 order by id desc limit 10\G;
***************************[ 1. row ]***************************
id            | 1
select_type   | SIMPLE
table         | solo_scores
partitions    | p0r0,p0r1,p0r2,p0r3,p1r0,p1r1,p1r2,p1r3
type          | index
possible_keys | user_id_index
key           | PRIMARY
key_len       | 16
ref           | <null>
rows          | 64
filtered      | 1.56
Extra         | Using where; Backward index scan

MySQL root@(none):osu> explain select id from solo_scores where user_id = 19743981 and ruleset_id = 0 and preserve in (0,1) order by id desc limit 10\G;
***************************[ 1. row ]***************************
id            | 1
select_type   | SIMPLE
table         | solo_scores
partitions    | p0r0,p1r0
type          | index
possible_keys | user_id_index
key           | PRIMARY
key_len       | 16
ref           | <null>
rows          | 56
filtered      | 0.35
Extra         | Using where; Backward index scan

+-----------------+----------------+---------+----------+
| Table           | AVG_ROW_LENGTH | Data MB | Index MB |
+-----------------+----------------+---------+----------+
| solo_scores     | 300            | 2835    | 369      |
| solo_scores_p   | 178            | 1780    | 359      |
| solo_scores_p_r | 178            | 1782    | 359      |
+-----------------+----------------+---------+----------+

(fixed by running ALTER TABLE solo_scores FORCE...)

peppy commented 1 year ago

Based on investigations above, the change in structure I'm looking to apply is:

--- old.sql 2023-09-01 18:17:38
+++ new.sql 2023-09-01 18:18:47
@@ -8,9 +8,9 @@
   `preserve` tinyint(1) NOT NULL DEFAULT '0',
   `created_at` datetime NOT NULL,
   `updated_at` timestamp NULL DEFAULT NULL,
-  PRIMARY KEY (`id`,`ruleset_id`,`preserve`,`created_at`),
-  KEY `user_ruleset_id_index` (`user_id`,`ruleset_id`,`id` DESC),
-  KEY `beatmap_id` (`beatmap_id`)
+  PRIMARY KEY (`id`,`preserve`,`created_at`),
+  KEY `user_ruleset_index` (`user_id`,`ruleset_id`),
+  KEY `beatmap_user_index` (`beatmap_id`,`user_id`)
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=4
 /*!50500 PARTITION BY RANGE  COLUMNS(`preserve`,created_at)
 (PARTITION p0catch VALUES LESS THAN (0,MAXVALUE) ENGINE = InnoDB,

This is a first checkpoint, I still have some further smaller changes to test:

TODO:

peppy commented 1 year ago

Updated with TODO changes applied:

--- old.sql 2023-09-01 18:17:38
+++ new.sql 2023-09-01 20:03:04
@@ -6,12 +6,12 @@
   `data` json NOT NULL,
   `has_replay` tinyint(1) DEFAULT '0',
   `preserve` tinyint(1) NOT NULL DEFAULT '0',
-  `created_at` datetime NOT NULL,
-  `updated_at` timestamp NULL DEFAULT NULL,
-  PRIMARY KEY (`id`,`ruleset_id`,`preserve`,`created_at`),
-  KEY `user_ruleset_id_index` (`user_id`,`ruleset_id`,`id` DESC),
-  KEY `beatmap_id` (`beatmap_id`)
+  `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
+  `updated_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
+  PRIMARY KEY (`id`,`preserve`,`updated_at`),
+  KEY `user_ruleset_index` (`user_id`,`ruleset_id`),
+  KEY `beatmap_user_index` (`beatmap_id`,`user_id`)
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=4
-/*!50500 PARTITION BY RANGE  COLUMNS(`preserve`,created_at)
+/*!50500 PARTITION BY RANGE COLUMNS(`preserve`,`updated_at`)
 (PARTITION p0catch VALUES LESS THAN (0,MAXVALUE) ENGINE = InnoDB,
  PARTITION p1 VALUES LESS THAN (MAXVALUE,MAXVALUE) ENGINE = InnoDB) */
peppy commented 1 year ago

Bonus round: compression testing to ensure we have things fine tuned

Compression testing

Time values are from ALTER. In general this seems very optimised, so the actual overheads for INSERT etc. operations would be higher.

# key_size=16
# Time: 268.081s
-rw-r-----  1 dean  admin  6069157888 Sep  4 12:35 solo_scores_p#p#p0catch.ibd
-rw-r-----  1 dean  admin  3179282432 Sep  4 12:35 solo_scores_p#p#p1.ibd

# key_size=8
# Time: 195.931s

-rw-r-----  1 dean  admin  3036676096 Sep  4 11:33 solo_scores_p#p#p0catch.ibd
-rw-r-----  1 dean  admin  1598029824 Sep  4 11:33 solo_scores_p#p#p1.ibd

# key_size=4
# Time: 204.383s

-rw-r-----  1 dean  admin  1606418432 Sep  4 12:20 solo_scores_p#p#p0catch.ibd
-rw-r-----  1 dean  admin   872415232 Sep  4 12:20 solo_scores_p#p#p1.ibd

# key_size=2
# Time: 249.902s

-rw-r-----  1 dean  admin  1602224128 Sep  4 12:24 solo_scores_p#p#p0catch.ibd
-rw-r-----  1 dean  admin  1019215872 Sep  4 12:24 solo_scores_p#p#p1.ibd

# row_format=compact
# Time: 122.149s

-rw-r-----  1 dean  admin  6069157888 Sep  4 12:35 solo_scores_p#p#p0catch.ibd
-rw-r-----  1 dean  admin  3179282432 Sep  4 12:35 solo_scores_p#p#p1.ibd

# row_format=dynamic
# Time: 58.805s

-rw-r-----  1 dean  admin  6480199680 Sep  4 12:42 solo_scores_p#p#p0catch.ibd
-rw-r-----  1 dean  admin  3376414720 Sep  4 12:42 solo_scores_p#p#p1.ibd