[WIP] RFC: Sharding Recommendations (Tracking Issue)

One of the aims of the vt toolset is to provide information to users to help them shard their Vitess clusters better. This document discusses the different artifacts that will form a sharding recommendation. It will also attempt to identify the low level elements associated with each artifact.

Key artifacts

Essentially the output of a sharding recommendation is a complete VSchema including the keyspace which will be sharded and the unsharded keyspace which will contain the sequence tables and source reference tables. Each table in a sharded keyspace will need their sharding key: which includes the columns which are part of the key and the type of Vindex that maps these columns to a keyspace_id and hence shard.

Non-goal Orthogonally we could also suggest the number of shards based on the size of data and performance parameters. But that is currently not in scope.

Sharding Keys

Properties of a sharding key for a table

High cardinality: data should be uniformly distributed across shards
Query patterns: common columns in filters
Avoid hotspots: Use vindex types to hash or distribute auto-increments / time-stamps
Data locality: join clauses should inform selection of sharding keys on related tables
Application level: in a multi-tenant system tenant related data should reside on the same shard

Inputs

Query Logs
Database schema: PKs, indexes, FKs
Data distribution metrics: row count, selectivity

Workflow

Based on all provided inputs the recommender will suggest one set of artifacts. If we can provide a simulation of this against the query logs using the planner and the suggested sharding strategy then we can perturb the vschema with other possibilities or with user suggested changes and compare.

In an "agentic" approach we can try out different options automatically to refine the vschema further.

It is probably not reasonable to expect a recommendation system to come up with the "ideal" set of artifacts, so the UX should allow users to refine them based on humanint.

Reference Tables

Reference Tables are tables that are replicated across all shards within a keyspace. They are designed to store small, relatively static datasets—such as configuration settings, lookup tables, or enumeration values. Duplicating these tables on every shard reduces the number of cross-shard queries required for queries that join these tables.

The reference table's source-of-truth is in a unsharded keyspace. All DMLs on that table go to the source table and it is expected to be kept updated on each shard using other means: like a VReplication Materialize workflow.

Candidates for reference tables would be:

used in many queries or expensive queries
tables with low row count compared to those in the operational tables
fairly static: low write qps on those tables. This is to minimize inconsistent lookups due to replica lag while replicating the reference data

Lookup Vindexes

Secondary (or lookup) vindexes complement the sharding key by minimizing cross-shard queries when there are expensive queries that have one or more covering columns. Based on the same logic used to identify sharding keys, other candidates for sharding keys that are not chosen as primary vindexes can be recommended as lookup vindexes provided they signficantly increase selectivity.

Each lookup vindex has an additional cost of maintenance on DMLs in the original (owner) table.

vitessio / vt