ppy / osu-infrastructure

37 stars 5 forks source link

Considerations for `solo_scores` table structure / indices #15

Closed peppy closed 6 months ago

peppy commented 10 months ago

For certain tasks like https://github.com/ppy/osu-queue-score-statistics/pull/149, we need the ability to do lookups on (user_id,ruleset_id,beatmap_id). Currently we have no index on this.

For other similar lookups in osu-web, elasticsearch is used. I have concerns for using elasticsearch are reliability and correctness. The latter is important: if an item in elasticsearch has been queued for indexing/reindexing which hasn't yet been completely, incorrect (outdated) data could be returned.

So I'm going to focus on the database side of things firstly.


Adding an index has both performance and data size concerns.

I'll be investigating and testing of structural changes to figure the correct path forward.

### Tasks
- [ ] https://github.com/ppy/osu-infrastructure/issues/16
- [ ] https://github.com/ppy/osu-infrastructure/issues/17
- [ ] https://github.com/ppy/osu-infrastructure/issues/18

Structural changes to be applied from investigations:

--- old.sql 2023-09-01 18:17:38
+++ new.sql 2023-09-01 20:03:04
@@ -6,12 +6,12 @@
   `data` json NOT NULL,
   `has_replay` tinyint(1) DEFAULT '0',
   `preserve` tinyint(1) NOT NULL DEFAULT '0',
-  `created_at` datetime NOT NULL,
-  `updated_at` timestamp NULL DEFAULT NULL,
-  PRIMARY KEY (`id`,`ruleset_id`,`preserve`,`created_at`),
-  KEY `user_ruleset_id_index` (`user_id`,`ruleset_id`,`id` DESC),
-  KEY `beatmap_id` (`beatmap_id`)
+  `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
+  `updated_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
+  PRIMARY KEY (`id`,`preserve`,`updated_at`),
+  KEY `user_ruleset_index` (`user_id`,`ruleset_id`),
+  KEY `beatmap_user_index` (`beatmap_id`,`user_id`)
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=4
-/*!50500 PARTITION BY RANGE  COLUMNS(`preserve`,created_at)
+/*!50500 PARTITION BY RANGE COLUMNS(`preserve`,`updated_at`)
 (PARTITION p0catch VALUES LESS THAN (0,MAXVALUE) ENGINE = InnoDB,
  PARTITION p1 VALUES LESS THAN (MAXVALUE,MAXVALUE) ENGINE = InnoDB) */
peppy commented 6 months ago

Closing as deployed.

Here's the final table structure, which did change from the version above in some ways. @ppy/team-web will need to be updated in database schemas at some point (minus the partitioning part probably):

CREATE TABLE `scores` (
  `id` bigint unsigned NOT NULL AUTO_INCREMENT,
  `user_id` int unsigned NOT NULL,
  `beatmap_id` mediumint unsigned NOT NULL,
  `ruleset_id` smallint unsigned NOT NULL,
  `data` json NOT NULL,
  `has_replay` tinyint(1) DEFAULT '0',
  `preserve` tinyint(1) NOT NULL DEFAULT '0',
  `ranked` tinyint(1) NOT NULL DEFAULT '1',
  `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `unix_updated_at` int unsigned NOT NULL DEFAULT (unix_timestamp()),
  PRIMARY KEY (`id`,`preserve`,`unix_updated_at`),
  KEY `user_ruleset_index` (`user_id`,`ruleset_id`),
  KEY `beatmap_user_index` (`beatmap_id`,`user_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=4
/*!50500 PARTITION BY RANGE  COLUMNS(`preserve`,unix_updated_at)
(PARTITION p20231216 VALUES LESS THAN (0,1702598400) ENGINE = InnoDB,
 PARTITION p20231217 VALUES LESS THAN (0,1702684800) ENGINE = InnoDB,
 PARTITION p20231218 VALUES LESS THAN (0,1702771200) ENGINE = InnoDB,
 PARTITION p0catch VALUES LESS THAN (0,MAXVALUE) ENGINE = InnoDB,
 PARTITION p1 VALUES LESS THAN (MAXVALUE,MAXVALUE) ENGINE = InnoDB) */