skeema / tengo

Go La Tengo: a MySQL automation library
Apache License 2.0
27 stars 19 forks source link

added ordinal_position in projection of foreign keys query #14

Closed harshit-gangal closed 3 years ago

harshit-gangal commented 3 years ago

This is needed in Vitess project to plan the order by clause in distributed mysql system. By adding order by columns in projection Vitess planner is able to plan the query otherwise it throws an error.

evanelias commented 3 years ago

Thank you for the pull request! I am excited about the potential of having Skeema be usable with Vitess.

I have two requests and one question:

harshit-gangal commented 3 years ago

The current fix will get past the Vitess planner. I will incorporate the comments.

  • Do you know yet if there are other incompatible queries/functionality, or if this is the only one? More broadly, if you could share any preliminary info on how Skeema and Vitess would potentially interact, that would be helpful. On a technical basis, currently Skeema has a lot of logic which assumes that it is speaking to each shard directly, and I gather this may be problematic with Vitess. And on a logistical basis, unfortunately I'm not able to spend much time working on open source right now, so I'm hoping to get a better sense of scope here. Thanks again!

As per what I have looked at the code, skeema heavily relies on information_schema responses. This ought to be problem with Vitess. Vitess tries to give you the illusion of keyspaces as database to the app but internally there can be multiple schema like system schema and some Vitess internal schemas. Also, the actual database name can be different from the keyspace name. Due to all response taken from information_schema that illusion is broken and it causes the skeema to query on databases not known to VTGate (Vitess routing component).

There are two option I can think of

  1. Skeema lives inside the VTTablet (sidecar to mysql) that handles the schema changes for Vitess. VTTablet then exposes an api that VTGate uses.
  2. Skeema changes the way it queries the database (show queries over information_schema queries). Most importantly allow filtering of schemas it want to watch to avoid getting information about the Vitess internal schemas.

Also, I see that Skeema uses dbname at connection time, instead if it adds qualifier to the tablename that will make Vitess route query in better way.

coveralls commented 3 years ago

Coverage Status

Coverage increased (+0.006%) to 93.252% when pulling 77ac092754a0806f25cb22dffeb2e9afb55f76ae on planetscale:orginal-pos-foreign-keys into d21902f025414cf1abb6b2d1ae2e287c13ad7cbc on skeema:main.

mavenraven commented 3 years ago

Hi @evanelias! Just for context, I'm working on the product that requested skeema support from the Vitess team. It turns out that there's not anything skeema need to do to support our specific use case. We won't ever have skeema talk directly to Vitess. Instead, we'll keep the canonical desired state of the schema in source control, and diff changes based on that. Then, we will apply those changes directly to Vitess.

The big UX improvment skeema could make in that flow is to allow skeema init talk to vtgate. Currently, you have to go the underlying mysql server, or write it out manually. Totally doable, but it could be nicer.

Long term, I would be curious about first class integration of skeema into Vitess. There's already experimental support for gh-ost via the following:

vtctlclient ApplySchema -ddl_strategy "gh-ost" -sql "ALTER TABLE demo MODIFY id bigint UNSIGNED" commerce

but we're sending imperative schema changes instead of desired state. Maybe the gh-ost functionality would be more robust with a view into the desired schema? @shlomi-noah

Anyways, thank you very much for building skeema, it's an incredible tool!

evanelias commented 3 years ago

Thanks @harshit-gangal and @mavenraven.

Due to all response taken from information_schema that illusion is broken and it causes the skeema to query on databases not known to VTGate (Vitess routing component).

I'm not really sure if there's anything Skeema can do about this. Skeema is designed to talk directly to the database. Middleware components could rewrite the queries and responses, but that's basically outside of Skeema's scope.

  1. Skeema lives inside the VTTablet (sidecar to mysql) that handles the schema changes for Vitess

This is the best long-term path from my point of view. I eventually hope to build a daemonized agent version of Skeema, responsible for automatically detecting merged changes to a git schema repo, and applying the corresponding schema changes to a single mysqld when relevant (there would be a service discovery mapping so each skeema agent "knows" which part of the schema repo applies to the local mysqld). Conceptually, the agent/sidecar model for automatic schema change application worked extremely well at massive scale at Facebook. However, implementing this generically would be a large undertaking.

  1. Skeema changes the way it queries the database (show queries over information_schema queries).

Skeema actually uses both information_schema queries as well as SHOW queries. Some information is only obtainable from one or the other, so it isn't possible to exclusively just use one of these. Skeema also has an important safety-check where it confirms that it can fully rebuild the SHOW CREATE TABLE text exactly from its introspection of table metadata. This is how it auto-detects existence of table features it doesn't support yet, such as sub-partitioning.

Most importantly allow filtering of schemas it want to watch to avoid getting information about the Vitess internal schemas.

This is already supported; please see the ignore-schema regex option, as well as the advanced forms of the schema option.

Also, I see that Skeema uses dbname at connection time, instead if it adds qualifier to the tablename that will make Vitess route query in better way.

This would be a somewhat complicated change. It would need to be optional and disabled by default, since otherwise it would definitely break some known use-cases.

Instead, we'll keep the canonical desired state of the schema in source control, and diff changes based on that. Then, we will apply those changes directly to Vitess.

To be clear, does this mean you won't ever be using skeema push against real databases?

The big UX improvment skeema could make in that flow is to allow skeema init talk to vtgate

I'm not really sure what that would entail, but just to set expectations, I suspect this would likely require either a third-party code contribution or corporate sponsored development. I'm dividing most of my time between working on the commercial edition of Skeema, and consulting engagements (mostly database performance consulting). While I hope to return to building open source Skeema features at some point in the future, that will be dependent on the outcome of the commercial products and whether that allows the project to be sustainable.

shlomi-noach commented 3 years ago

Hey @evanelias ! :wave: I'll begin with the real important thing which is your well being:

I suspect this would likely require either a third-party code contribution or corporate sponsored development. I'm dividing most of my time between working on the commercial edition of Skeema, and consulting engagements

Absolutely clear. we will look internally into that; we can provide the contributions, and make them backwards compatible, but I also know this sometimes still requires considerable effort from the maintainer's side (will you be happy then maintaining the change? does it add complexity? sheer review effort; etc.). Please let us know if this still makes sense to you, we can discuss in private channels as needed.

To be clear, does this mean you won't ever be using skeema push against real databases?

There's actually two use cases here. One is @mavenraven's, another is something I work on, internally. @mavenraven's use case is more of the traditional application developer who wants schema versioning for their database (but with Vitess as backend), and they're looking for something similar to skeefree logic, if you remember. Mine is more about general purpose automation and the dust hasn't settled yet, and is a longer term work. I think in both cases skeema won't speak directly to production. If it does, then it will connect to vtgate and issue what it thinks to be direct schema changes, where vtgate will in fact translate those to online DDL and distribute to underlying shards for self-scheduling and execution.

I'm pretty sure we will not let skeema understand about shards, because of how vitess owns topics like resharding; so we'd want to pass requests through vitess in some way or another.

The idea of skeema on each vttablet is something I haven't considered yet; you say this worked extremely well at Facebook; I'd be happy to pick your brain to understand better how you manage parallel developments and possibly concurrent schema changes. Does each endpoint/sidecar runs a git pull against some repository?

but we're sending imperative schema changes instead of desired state. Maybe the gh-ost functionality would be more robust with a view into the desired schema?

@mavenraven I guess the previous paragraph applies here, too: at some point, we want to be able to say "this is the change implied by this developer" and we need to make sure two different developers do not step on each other's toes. To explain, imagine we have this table t(id int), one developer adds column a: t(id int, a int), one developer adds column b: t(id int, b int). Say developer 1 makes it first. Now, as developer's 2 schema is to be applied, it seems like to apply it we should drop column a, which is incorrect. To fix that, we need to either enforce developer's branches to always update main branch before applying, or to evaluate their changes based on base...head comparison (the flow you are looking into now). Either way, the way I see it, running that, and skeema, on the sidecars (vttablet or whatever) is very far (too far?) from the developer's code. Either sidecars have to be super smart about the owning git repo (or multiple repos!), or, someone needs to make the call for them, in which case why not make the call for the imperative change anyhow.

evanelias commented 3 years ago

I think in both cases skeema won't speak directly to production. If it does, then it will connect to vtgate

Ah, I see. In this case, I'm sorry but it won't make sense for me to proceed with this effort. Non-idiomatic use-cases like this have repeatedly caused support headaches for me, and I don't have the bandwidth for more of them at this time.

Skeema is designed to natively support arbitrary sharding schemes. However, it operates with the assumption that it can talk directly to each MySQL shard, as this has been a safe assumption for automation tools in every sharded MySQL environment I've worked on over the past decade. If Vitess's model simply does not permit this, I don't think Skeema can be compatible enough to be used in an idiomatic way, at least in Skeema's current form. I do plan to add lower-level "plumbing" commands in the future, basically de-composing skeema push into a number of separate smaller operations, which will enable companies with complex requirements to build more customized automation pipelines. However, since this functionality is only needed by larger companies, and will involve more extensive technical support, it is planned to be a paid/commercial only feature.

Anyway, if I understand correctly, it sounds like you're mainly hoping to use Skeema only for its declarative diff / DDL generation logic and not the majority of its overall functionality. If so, a custom Vitess-native solution probably makes more sense here than trying to shoe-horn Skeema in a way that it isn't intended to be used. Since Vitess already contains an excellent SQL parser, and already has its own internal representations of various DB concepts, it may be more straightforward to use those as the basis for native schema management in Vitess.

mavenraven commented 3 years ago

Thanks @harshit-gangal and @mavenraven.

Due to all response taken from information_schema that illusion is broken and it causes the skeema to query on databases not known to VTGate (Vitess routing component).

I'm not really sure if there's anything Skeema can do about this. Skeema is designed to talk directly to the database. Middleware components could rewrite the queries and responses, but that's basically outside of Skeema's scope.

  1. Skeema lives inside the VTTablet (sidecar to mysql) that handles the schema changes for Vitess

This is the best long-term path from my point of view. I eventually hope to build a daemonized agent version of Skeema, responsible for automatically detecting merged changes to a git schema repo, and applying the corresponding schema changes to a single mysqld when relevant (there would be a service discovery mapping so each skeema agent "knows" which part of the schema repo applies to the local mysqld). Conceptually, the agent/sidecar model for automatic schema change application worked extremely well at massive scale at Facebook. However, implementing this generically would be a large undertaking.

+1, I think this model makes the most sense long term. Since Vitess wants to own that mapping, it would want to learn about declarative schema as a first class thing. That would be unrelated to any work that might be done on skeema to make that possible thought.

To be clear, does this mean you won't ever be using skeema push against real databases?

Correct.

The big UX improvment skeema could make in that flow is to allow skeema init talk to vtgate

I'm not really sure what that would entail, but just to set expectations, I suspect this would likely require either a third-party code contribution or corporate sponsored development. I'm dividing most of my time between working on the commercial edition of Skeema, and consulting engagements (mostly database performance consulting). While I hope to return to building open source Skeema features at some point in the future, that will be dependent on the outcome of the commercial products and whether that allows the project to be sustainable.

Oh yes, I didn't expect you [or anyone else] to build this. Just adding some UX context to this thread for people who try to use Vitess with skeema in the future.

harshit-gangal commented 3 years ago

I am going to close this request.

mavenraven commented 3 years ago

@mavenraven I guess the previous paragraph applies here, too: at some point, we want to be able to say "this is the change implied by this developer" and we need to make sure two different developers do not step on each other's toes. To explain, imagine we have this table t(id int), one developer adds column a: t(id int, a int), one developer adds column b: t(id int, b int). Say developer 1 makes it first. Now, as developer's 2 schema is to be applied, it seems like to apply it we should drop column a, which is incorrect. To fix that, we need to either enforce developer's branches to always update main branch before applying, or to evaluate their changes based on base...head comparison (the flow you are looking into now). Either way, the way I see it, running that, and skeema, on the sidecars (vttablet or whatever) is very far (too far?) from the developer's code. Either sidecars have to be super smart about the owning git repo (or multiple repos!), or, someone needs to make the call for them, in which case why not make the call for the imperative change anyhow.

Oh yeah, I didn't think about this at all. With skeefree, it's sidestepped, but I guess it wouldn't be if it's a first class Vitess featue.