vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.47k stars 2.09k forks source link

RFC: improved information schema results #12693

Open systay opened 1 year ago

systay commented 1 year ago

In relational databases, the information schema (information_schema, or shorter, I_S) is an ANSI-standard set of read-only views that provide information about tables, views, columns, procedures and more in a database system.

Purpose of this RFC

This RFC will try to describe the intended approach for improved I_S routing, and point out problems with this approach. Specifically, I'm trying to solve two problems with todays solution:

  1. Multiple keyspaces. Since a Vitess system can have multiple keyspaces, and these can be backed by mysql instances that know nothing about the other keyspaces, no one keyspace will contain information about all tables in the Vitess system. It is not possible to get query results from multiple keyspaces today.

  2. Database names. From the users perspective, a keyspace is equivalent to a mysql database. A keyspace named XXX can be backed by a mysql database named YYY. Today, we return the whatever the underlying mysql information_schema has, which might not be the same as the keyspace name.

Suggestion

Individual view routing

For every I_S view being queried, we need to figure out which of three situations we are dealing with.

  1. This is a view that does not contain any column with schema information in it. Example query:
select * 
from information_schema.character_sets
  1. This is a view that does have a column we can use to see which keyspace every row belongs to, and we have a predicate that we can use to figure out where we need to send the query. Example query:
select table_name, column_name 
from information_schema.columns 
where table_schema = 'user'
  1. Same type of view, but now we are missing a predicate we can use to route the query to the right keyspace. Example query:
select table_name, column_name 
from information_schema.columns 
where column_name = 'name'

Approach depending on type of view

  1. For views of type 1, we are free to send the query to any keyspace. These views are expected to contain the same information across all MySQL installations.
  2. The second type is also pretty straightforward - thanks to the predicate, we know which keyspace we want to send the query to.
  3. The last one is the problematic one. Since this is a view with information about all keyspaces, we need to send the query to all keyspaces. We can't just pick one - the results would be missing the information from the other keyspaces. The suggested solution is to do a UNION between the results from all keyspaces. This UNION is done on the vtgate, not sent down to any MySQL instance.

Given a system that has two keyspaces (ks1 and ks2). Example plan for query #3 above:

(ks1) select table_name, column_name from information_schema.columns where column_name = 'name'
UNION ALL
(ks2) select table_name, column_name from information_schema.columns where column_name = 'name' and TABLE_SCHEMA = 'ks2`

In the query above, the first SELECT is sent to ks1, and it will contain information about ks1, but also about all the default databases installed by default: information_schema, mysql, performance_schema andsys.

The second SELECT is sent to ks2, and now we are only interested in information about ks2. All the default stuff was already covered by the first query.

Handling database name != keyspace name

In Vitess, you can use a different name for the MySQL database and the keyspace. Not uncommon is that the keyspace user is stored in the database vt_user. To make sure the user sees the keyspace name, we have to use a little logic.

Given a query such as:

select 
  table_name, 
  column_name, 
  table_schema
from information_schema.columns 
where column_name = 'name'

The user wants to see the contents of the TABLE_SCHEMA column, which will contain the database name and not the keyspace name. What we would do is, instead of just using the table_schema column, we construct a CASE like so:

select 
  table_name, 
  column_name, 
  case
    when table_schema = 'vt_user' then 'user' 
    else table_schema
  end as table_schema
from information_schema.columns 
where column_name = 'name'

This is the query we would use for the first of the SELECTs we UNION together. For the rest of them, we have the keyspace name already, and can inject it as a literal instead of using the value from the I_S:

select 
  table_name, 
  column_name, 
  'ks2' as table_schema
from information_schema.columns 
where column_name = 'name'

Merging of I_S views

One important decision the planner does is to decide which tables can be fetched together from a single route. This is called merging routes. My suggestion would be that we merge the first and second type of views, but not the third type of queries. To spell it out:

  1. Type 1 views can be merged with any other type 1 view
  2. Type 2 views can be merged with other type 2 views aimed at the same keyspace.
  3. Type 3 cannot be merged with any other.

Possible problems

mcrauwel commented 1 year ago

I'm going to make the problem just a tad more complex. You should also be able to filter out (materialized) reference tables... They will otherwise show up in every keyspace...

shlomi-noach commented 1 year ago

I like everything suggested in the original comment.

There are more complex scenarios:

If this is helpful, here are some complex examples for querying information_schema. These are all described as views, but of course for most of them you can just extract the view's SELECT query as reference:

etc.

systay commented 1 year ago

@shlomi-noach, thanks for the links, I'll read up and make sure we support these.

  • any custom query the user may run on information_schema, including JOINs and subselects. You need to potentially UNION ALL some of these?

This is not a problem. This is kind of what I was talking about under the "Merging of I_S views". The planner plans the routing per table listed in the SELECT query, and pushes down joins and subqueries to the same query if they can be merged, and otherwise does the joins and subqueries on the vtgate.

  • What if the user creates a VIEW that reads from information_schema? Is that automatically covered by the above?

Since we are handling views by just turning them into a derived table, this approach should work well for views as well.

systay commented 1 year ago

Added two possible problems - MoveTables and reference tables

GuptaManan100 commented 1 year ago

Love the proposal. I 💯 % agree with all the points.