vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.53k stars 2.09k forks source link

queries: prune per-shard SELECT queries with multi-column IN #12015

Open derekperkins opened 1 year ago

derekperkins commented 1 year ago

in an older version of Vitess (v5 era), we avoided a scatter query when using a multi-column IN clause by extracting out the sharding key into a separate single IN. e.g.

select * from urls where domain_id IN (11111, 22222) AND (domain_id, url) IN ((11111, 'abc'), (22222, 'def'));

Revisiting this in v15.0.2, support has improved somewhat, but still has room for improvement. The route operator is MultiEqual instead of IN, which isn't clear from the docs whether it is better/same/worse than the IN variant (see https://github.com/vitessio/vitess/pull/7049) . Digging in a little deeper, we can see that the single column IN is correctly pruned per partition, while the multi-column IN just sends the entire query to each shard.

We often have queries with hundreds of multi-column IN values, and where vtgate is already correctly pruning single column values, it would be more performant to also prune multiple column values. This was particularly bad a while ago when we were testing out MyRocks, which IIRC is particularly slow on non-existent key lookup.

Some form of this exists already for DML, since multi-value inserts are pruned per shard, but it seems likely that is handled in a different section of the codebase.

with single column IN

explain format=vitess select * from urls where domain_id IN (11111, 22222) AND (domain_id, url) IN ((11111, 'abc'), (22222, 'def'));

explain format=vtexplain select * from urls where domain_id IN (11111, 22222) AND (domain_id, url) IN ((11111, 'abc'), (22222, 'def'));
+----------+------------+----------+-------------+------------+---------------------------------------------------------------------------------------------------------------------------------------+
| operator | variant    | keyspace | destination | tabletType | query                                                                                                       |
+----------+------------+----------+-------------+------------+---------------------------------------------------------------------------------------------------------------------------------------+
| Route    | IN         | domains  |             | UNKNOWN    | select url_id, domain_id, host_id, url from urls where domain_id in ::__vals and (domain_id, url) in ((11111, 'abc'), (22222, 'def')) |
+----------+------------+----------+-------------+------------+---------------------------------------------------------------------------------------------------------------------------------------+

+---+----------+-------+--------------------------------------------------------------------------------------------------------------------------------------+
| # | keyspace | shard | query                                                                                                                                |
+---+----------+-------+--------------------------------------------------------------------------------------------------------------------------------------+
| 0 | domains  | c0-e0 | select url_id, domain_id, host_id, url from urls where domain_id in (22222) and (domain_id, url) in ((11111, 'abc'), (22222, 'def')) |
| 1 | domains  | 20-40 | select url_id, domain_id, host_id, url from urls where domain_id in (11111) and (domain_id, url) in ((11111, 'abc'), (22222, 'def')) |
+---+----------+-------+--------------------------------------------------------------------------------------------------------------------------------------+

without single column IN

explain format=vitess select * from urls where (domain_id, url) IN ((11111, 'abc'), (22222, 'def'));

explain format=vtexplain select * from urls where (domain_id, url) IN ((11111, 'abc'), (22222, 'def'));
+----------+------------+----------+-------------+------------+---------------------------------------------------------------------------------------------------------------------------------------+
| operator | variant    | keyspace | destination | tabletType | query                                                                                                       |
+----------+------------+----------+-------------+------------+---------------------------------------------------------------------------------------------------------------------------------------+
| Route    | MultiEqual | domains  |             | UNKNOWN    | select url_id, domain_id, host_id, url from urls where (domain_id, url) in ((11111, 'abc'), (22222, 'def')) |
+----------+------------+----------+-------------+------------+---------------------------------------------------------------------------------------------------------------------------------------+

+---+----------+-------+-------------------------------------------------------------------------------------------------------------+
| # | keyspace | shard | query                                                                                                       |
+---+----------+-------+-------------------------------------------------------------------------------------------------------------+
| 0 | domains  | c0-e0 | select url_id, domain_id, host_id, url from urls where (domain_id, url) in ((11111, 'abc'), (22222, 'def')) |
| 1 | domains  | 20-40 | select url_id, domain_id, host_id, url from urls where (domain_id, url) in ((11111, 'abc'), (22222, 'def')) |
+---+----------+-------+-------------------------------------------------------------------------------------------------------------+

Related:

Side notes:

cc @vitessio/query-serving

systay commented 1 year ago

Hi @derekperkins!

  • There are docs for format=vtexplain (link), but not format=vitess

In V16, we are replacing format=vitess with vexplain. That is better documented: https://vitess.io/docs/16.0/user-guides/sql/vexplain/

Unfortunately, we are not likely to prioritize updating docs on a deprecated command on the V15 docs.

If there is anything in particular you are curious about, feel free to ping me on Slack