vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.42k stars 2.08k forks source link

vtgate crashes during multiple shards executing DDL #7046

Closed inolddays closed 6 months ago

inolddays commented 3 years ago

we execute an alter table DDL statment(add two columns) throught vtgate on one keyspace which has 16 shards. And found the vtgate clusters are becoming crashes. After crashes for some minutes (almost near 30 minutes) All vtgates are available again Here is the call stack: image

have been discussed with @sougou
this problem may because when we applying ddl on multiple shard and at the same time the business app runs select as well. vtgate send query to first shard and first shard says it 5 columns but the other shard return 4. vtgate then expects uniform number of columns. One solution is persuade user not use select and rewrite sql like :explicitly select a,b,c, then it won't fail . But some times it not that easy to make user change. so we need to add some protections on this situation. This does not fundamentally solve the problem but at least vtgate will not crash Here may another pr i paste it here to track if it is something could refer on or related : https://github.com/vitessio/vitess/issues/5572

aquarapid commented 3 years ago

The better solution may be to use vschema, that includes the table column lists with authoritative set to true; then perform the DDL to add the column against the shards. When the DDL on all the shards are complete, update the vschema to add the new column. In this way, a select * will not expand to additional columns until after the vschema update is done.

inolddays commented 3 years ago

The better solution may be to use vschema, that includes the table column lists with authoritative set to true; then perform the DDL to add the column against the shards. When the DDL on all the shards are complete, update the vschema to add the new column. In this way, a select * will not expand to additional columns until after the vschema update is done.

this may can solve the problem. but one thing should care about , will this kind of realization cause too many changes compare with the former code structure ?

inolddays commented 3 years ago

reproduce it on my labtop @sougou
image when i add new column directly to first shard's master mysql. then i exeute "select * from sbtest limit 1\G" image debug the vttablet: image image fileds and row.length are not equal cause the query.plan.fields are used old one. At first i add two pieces of code on vtgate like this: image but it will not work with the ddl like : alter table sbtest1 add grade1 int(11) unsigned NOT NULL DEFAULT '0' COMMENT 'grade1' after _hd_update_region; fields' order are not rightly be parsed then. so i guess pr https://github.com/vitessio/vitess/pull/5572 seems fix this problem but it bring another flag, and it may not easy for people to know when the right time to set the flag as false. @aquarapid gives a perfect solution that i agree also. currently the temporary solution for me is to bring the flag "watch_replication_stream" back on vttablet. this flag will seeing ddl change on schema and reload schema instantly and clear the query plan, This will greatly reduce the likely occurrence of such a panic.(just in the condition when doing ddl through vtgate) It's important to note that the tests above are all based on 4.0, but I've looked at and compared the latest master branch code and the same problem exists.

sougou commented 3 years ago

Chiming in. We should still fix the panic. If the field length is longer than the number of columns returned, maybe we can pad with nulls, or return an error. Maybe returning an error is better.

xhh1989 commented 3 years ago

vtgate is a gateway cluster that supports multi-tenancy. Some of our gateways are used by hundreds of applications. If an application has a similar problem causing the gateway cluster to crash, hundreds of applications will be unavailable, This will be a very serious accident, I think The priority should be P0

sougou commented 3 years ago

This should only be an internal panic. I don't think vtgate actually crashes at this point.

inolddays commented 3 years ago

This should only be an internal panic. I don't think vtgate actually crashes at this point.

vtgate will crashes. it depends on how long time vttablet will reload schema itself. Default value is 30 minutes. In other words, if all applications use the same vtgate cluster , vtgate wil not be available until 30 minutes later. During this time, when there is query like "select *", gates cluster will alway crashes

harshit-gangal commented 6 months ago

We have schema tracking at vtgate which should solve this As it does star expansion for sharded queries.