vitessio / vt

Apache License 2.0
5 stars 2 forks source link

[WIP] RFC: Sharding Recommendations (Tracking Issue) #68

Open rohit-nayak-ps opened 2 days ago

rohit-nayak-ps commented 2 days ago

One of the aims of the vt toolset is to provide information to users to help them shard their Vitess clusters better. This document discusses the different artifacts that will form a sharding recommendation. It will also attempt to identify the low level elements associated with each artifact.

Key artifacts

Essentially the output of a sharding recommendation is a complete VSchema including the keyspace which will be sharded and the unsharded keyspace which will contain the sequence tables and source reference tables. Each table in a sharded keyspace will need their sharding key: which includes the columns which are part of the key and the type of Vindex that maps these columns to a keyspace_id and hence shard.

Non-goal Orthogonally we could also suggest the number of shards based on the size of data and performance parameters. But that is currently not in scope.

Sharding Keys

Properties of a sharding key for a table

Inputs

Workflow

Based on all provided inputs the recommender will suggest one set of artifacts. If we can provide a simulation of this against the query logs using the planner and the suggested sharding strategy then we can perturb the vschema with other possibilities or with user suggested changes and compare.

In an "agentic" approach we can try out different options automatically to refine the vschema further.

It is probably not reasonable to expect a recommendation system to come up with the "ideal" set of artifacts, so the UX should allow users to refine them based on humanint.

Reference Tables

Reference Tables are tables that are replicated across all shards within a keyspace. They are designed to store small, relatively static datasets—such as configuration settings, lookup tables, or enumeration values. Duplicating these tables on every shard reduces the number of cross-shard queries required for queries that join these tables.

The reference table's source-of-truth is in a unsharded keyspace. All DMLs on that table go to the source table and it is expected to be kept updated on each shard using other means: like a VReplication Materialize workflow.

Candidates for reference tables would be:

Lookup Vindexes

Secondary (or lookup) vindexes complement the sharding key by minimizing cross-shard queries when there are expensive queries that have one or more covering columns. Based on the same logic used to identify sharding keys, other candidates for sharding keys that are not chosen as primary vindexes can be recommended as lookup vindexes provided they signficantly increase selectivity.

Each lookup vindex has an additional cost of maintenance on DMLs in the original (owner) table.

frouioui commented 1 day ago

Essentially the output of a sharding recommendation is a complete VSchema including the keyspace which will be sharded and the unsharded keyspace which will contain the sequence tables and source reference tables.

I read this as in the new feature will only suggest two keyspaces (sharded and unsharded). If the MySQL query log given by the end-users uses different logical database (use db1, etc), will it take into account that the users may want to have more than one sharded keyspace?

  • Data distribution metrics: row count, selectivity

How will this be provided by the users?