vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.61k stars 2.1k forks source link

RFC: Vschema migrations #5032

Open tirsen opened 5 years ago

tirsen commented 5 years ago

We now have ALTER VSCHEMA which is awesome.

On top of that I propose we build a schema migration management system. A schema migration system is a way to structure your schema as a sequence of versioned alterations from the empty schema. This means if you're deploying a version of your app that expects your schema to be at version N but in production you have only applied migrations up to version N-X then you can simply apply migrations (N-X+1, N-X+2, ..., N) to achieve the schema required for the new version of the app.

Schema migration tools are pretty much standard for applications which use databases with schemas these days: https://en.wikipedia.org/wiki/Schema_migration https://martinfowler.com/articles/evodb.html Java: https://flywaydb.org/getstarted/java Ruby on Rails: https://edgeguides.rubyonrails.org/active_record_migrations.html PHP: https://laravel.com/docs/5.8/migrations Django: https://docs.djangoproject.com/en/2.2/topics/migrations/

I propose we store vschema migration state in the topology server and add a few commands to vtctld to query current vschema migration state and idempotently apply vschema migrations.

vtctlclient ApplyVSchemaMigration -version=<version> {-sql=<sql> || -sql_file=<sql file> || -only-record} [-cells=c1,c2,...] [-skip_rebuild] [-dry-run] <keyspace>

If the migration with the specified version has already been executed it fails with an error. If not then the ALTER VSCHEMA in the sql file is executed and if successful makes a record of the completed migration for the given keyspace. Also locks the keyspace such that operators can't execute multiple schema migrations concurrently.

  -version string
        A named version used to identify the migration in the application code
  -only-record
        If set does not run the alter statement, instead just makes a record that it was run.
  -cells value
        If specified, limits the rebuild to the cells, after upload. Ignored if skipRebuild is set.
  -dry-run
        If set, do not save the altered vschema, simply echo to console.
  -skip_rebuild
        If set, do no rebuild the SrvSchema objects.
  -sql alter-statement
        A vschema ddl SQL statement (e.g. add vindex, `alter vschema create vindex hash(id)`, etc)
  -sql_file alter-statement file
        A vschema ddl SQL statement (e.g. add vindex, `alter vschema create vindex hash(id)`, etc)

vtctlclient ListVSchemaMigrations [-cells=c1,c2,...] <keyspace>

Returns a list of all the vschema migrations that have been executed against the given keyspace.

  -cells cells
    The cells to list. If no cells are specified all cells are queried. If multiple cells are queried then we do a consistency check that they have the same state, if they do not have the same state we return an error showing the deltas.

We should also extend vttestserver such that it accepts ALTER VSCHEMA files as migrations in the schema directory.

This is not a complete schema migration management system but it's a good start. Application code can extend this for completion. For example:

To preempt a few questions:

Why not just build all of this in application code? First of all it would be nice if Vitess provides a standard way to do this. Secondly we do want to store the migration state in the topology server and that's not so easy to do for application code. We also want to lock the vschema for concurrent alterations which again is not so easy for application code.

Why do we want to store the vschema migration state in the topology server? Storing the migration state in the same medium as the thing it alters is a good idea since it has the same lifecycle of backups, deletion and so on. Where else would we store it? In one of the databases shards? Which one? What if the database is restored from backup? Now the vschema migration state is out of sync with the vschema. Storing the state in the topology server solves all of this.

tirsen commented 5 years ago

When I think about this some more I'm starting to think maybe it would be better to just have some form of general "metadata" storage that is accessible through the MySQL protocol. The you could solve this at the application/library level instead.

Not sure about the syntax but something but I was just thinking a key-value map per keyspace with support for setting and getting data through vtgate SQL statements.

sougou commented 5 years ago

I like this. I was beginning to think that some of the workflows I'm building will require additional metadata. So, a generic key-value table-like interface should work great. I'm wondering if we could have it as a logical table like information_schema.vitess_metadata.

demmer commented 5 years ago

This sounds good to me too.

I wonder if it makes more sense as part of information_schema.vitess_metadata or instead an analogous vitess_information_schema.metadata? If we did the latter, we could expose other information there, like the shard topology, keyspaces, the current vschema contents (as an alternative to SHOW VSCHEMA), etc.

Basically... all the things we added to SHOW could also be exposed through this.

On Mon, Aug 12, 2019 at 5:08 PM Sugu Sougoumarane notifications@github.com wrote:

I like this. I was beginning to think that some of the workflows I'm building will require additional metadata. So, a generic key-value table-like interface should work great. I'm wondering if we could have it as a logical table like information_schema.vitess_metadata.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

tirsen commented 5 years ago

One idea I had was to just hijack the SET syntax. I.e. something like:

SET @@vitess_metadata.<my key> = ''

But yeah I admit that is a bit lame but it would be extremely easy to implement :-)

A table interface would be one step up for sure but I worry about implementation cost. How complex would that be to build?

Is the information_schema hard standardized? Is it unadvised to add additional tables there?

tirsen commented 5 years ago

We're going with the SET syntax for now.