Closed keweishang closed 3 years ago
Maybe @rohit-nayak-ps has some idea of the error? Seems related to schema versioning.
Also during a long-running gh-ost
job, if I create a new vstream gRPC client that subscribes to the current vgtid (gtid
set to current
), the vstream gRPC client receives an error:
UNKNOWN: target: test_sharded_keyspace.-80.replica, used tablet: zoneA-201 (prelive-ib-tablet-201.vt): vttablet: rpc error: code = Unknown desc = stream (at source tablet) error @ faa85f08-2c16-11eb-ac78-060146dd04aa:1-2930879,facea7d1-2c16-11eb-8519-0243e15aa530:1: unknown table _19344c31_30ce_11eb_98d2_029e8414c92c_20201127163231_gho in schema
I tried reproducing this with both sharded and unsharded keyspaces with the table being continuously populated with data so that the gh-ost alter table is long-running. No luck.
I also notice that there are some changes in the gh-ost based functionality. Not sure if it impacts the bug you are encountering but was wondering if it is possible for you to check if the same problem happens on master.
The latest code changes the way you invoke gh-ost like:
ApplySchema -ddl_strategy "gh-ost" -sql "ALTER table customer add column x1 int default 0" customer
I know you mentioned this only happens with huge tables, I was testing with ~100k. I will start populating one locally for testing this ...
@rohit-nayak-ps thanks for replying.
I managed to reproduce the following error (unknown table
error) every time locally (table size 500k rows), with GA v8.0.0.
Also during a long-running gh-ost job, if I create a new vstream gRPC client that subscribes to the current vgtid (gtid set to current), the vstream gRPC client receives an error:
UNKNOWN: target: test_sharded_keyspace.-80.replica, used tablet: zoneA-201 (prelive-ib-tablet-201.vt): vttablet: rpc error: code = Unknown desc = stream (at source tablet) error @ faa85f08-2c16-11eb-ac78-060146dd04aa:1-2930879,facea7d1-2c16-11eb-8519-0243e15aa530:1: unknown table _19344c31_30ce_11eb_98d2_029e8414c92c_20201127163231_gho in schema
~Interestingly, this error only happens if my vstream client subscribes to the REPLICA
tablet type, if I switch to subscribing to the MASTER
tablet type, there is no problem so far.~ I actually managed to reproduce this error with vstream subscribing to both MASTER
tablet type and REPLICA
tablet type.
Add @rgibaiev to follow this issue as well.
I also managed to reproduce the following error (columns and values mismatch error) repeatatively with GA v8.0.0:
target: test_sharded_keyspace.80-.master, used tablet: zone1-300 (0297c7837f92): vttablet: rpc error: code = Unknown desc = stream (at source tablet) error @ 1ba0bed0-3332-11eb-b9d0-0242ac110002:1-1045: cannot determine table columns for bar_entry: event has [8 254 17 17 8 8 8 15 246 254 246 1 2 246 3 3], schema as [...basically one less column...]
The steps are:
REPLICA
tablet type of the sharded keyspace.vtctlclient -server localhost:15999 ApplySchema -sql "ALTER WITH 'gh-ost' TABLE bar_entry add column status int" test_sharded_keyspace
running
status for all shards, start a 2nd VStream client subscribing to the current position of the REPLICA
tablet type of the keyspace, start a 3rd VStream client subscribing to the current position of the MASTER
tablet type of the keyspace. There is a chance that one of the 2 VStream clients would fail with the "unknown table" error.gh-ost
job is in complete
status for all shards, within maybe 20 seconds. Start a 4th VStream client subscribing to the current position of the REPLICA
tablet type of the keyspace, start a 5th VStream client subscribing to the current position of the MASTER
tablet type of the keyspace. Insert a few rows to the table, there is a chance that the VStream client who subscribes to the MASTER
tablet type would fail with the "columns and values mismatch" error.I'll test the master branch tomorrow, as you suggested @rohit-nayak-ps
@keweishang, I was able to repro on master branch as well, so no need to test on it! For me too I was able to get it only while pointing to replica and not to master, but it might be a race. As you suspected, the schema is not getting reloaded correctly by vstreamer after the gh-ost operation completes. Sugu suggested gh-ost might be explicitly reloading schema on master (where gh-ost runs), so we don't see the error there. Will let you know once we have progress.
@rohit-nayak-ps, it's great that you can reproduce the errors on your side now. For me, both the "unknown table" and the "column mismatch" errors also happened when pointing vstream to master as well.
Sure. Keep me updated here and let me know if you need any further information.
@rohit-nayak-ps happy to look into reloading schema after gh-ost
(or pt-osc
) complete.
Quick update from discussing with @rohit-nayak-ps , we will seek a way to trigger ReloadSchemaShard
(or equivalent) from the TabletServer that runs the migration. Issue is that we need to invoke the reload not only on the master
(almost trivial) but on all shard tablets; the TabletServer doesn't have the gRPC mechanism to communicate directly to other tablets.
Thanks for the update. @shlomi-noach So you meant ReloadSchemaShard
needs to be triggered in all 3 TabletServers (in case of a shard having 1 MASTER Tablet, 1 REPLICA Tablet, and 1 RDONLY Tablet) in the shard? Do you have some potential solutions in mind? For example, if there's no gRPC between TabletServers, can they communicate via etcd?
So you meant
ReloadSchemaShard
needs to be triggered in all 3TabletServers
Yes, assuming I understand correctly; specifically, we need to reload on the replica where vstream runs on.
@rohit-nayak-ps has a workaround meanwhile, I'll update soon.
The workarounds I had discussed (while we wait for an automatic schema load post-migration) are:
vtctl ReloadSchemaKeyspace <keyspace>
to be manually run on the command line which forces all tablets in that keyspace to do a schema reload.
Run a tracker which runs a vstream for schema tracking (which as a side-effect reloads the keyspace when it encounters a DDL). Since you are already running vstreams this does not apply. In any case, as I mentioned in a previous comment, there seems to be a bug where vstreams are NOT reloading the schema when a gh-ost rename occurs. Hope to make progress on this tomorrow.
Sorry, but I am not able to repro this anymore. I have been testing for a while now using this setup:
Ran the local example to the end except the 401 teardown script, so that the resharded customer keyspace is live
Populated the customer table with ~15 million rows, using the populate.go
script in the gist
https://gist.github.com/rohit-nayak-ps/f8356ea2b9862b1d8cb1c3f2266265ec
Ran multiple gh-ost migrations adding new columns/altering column types like:
ApplySchema -ddl_strategy "gh-ost" -sql "alter table customer add x1 int not null default 0" customer
While gh-ost was running, ran three instances of the vstream_client.go
script
The vstream clients did not error out while running continuously or if started again either with current or from the beginning
Not sure how this is different from your setup. Since it is happening consistently for you @keweishang, it will be nice if you can repro using the same setup with any mods to recreate the bug, since it is then easier for any of us here to debug. I am running this on the current master (though I don't think we have new code that could have fixed this error).
Hi @rohit-nayak-ps, sure, I'll try and use your setup to reproduce the error. Will keep you updated this week.
Hi @rohit-nayak-ps, sorry for the delay. Based on GA 8.0.0 docker image, I can repetitively reproduce the errors. I've created a public repo with README that has the steps to reproduce the errors: https://github.com/keweishang/schema_reload_error_test
Let me know if you manage to reproduce the error with the above repo setup. Thanks.
@keweishang , thanks for the great test repo. I was able to reproduce the "cannot determine table columns" issue, even with the latest code. The issue with the internal tables created by gh-ost has been resolved in #7159, so it doesn't appear now.
The cause is:
The default is to not run the tracker, so #1 doesn't apply. When #2 is also not applicable, ie when we call VStream API only after the migration is complete, we are then dependent on #3, vttablet's automatic reload. #4 is impractical for production use.
In our case the VStream API is called, with gtid set to "current", before the periodic upload, The schema is then not in sync. This results in the schema-mismatch error that is thrown.
We discussed reloading the schema once Online DDL completes a migration. However we need to resolve a couple of things before we can do that
So this requires more thought and will not happen in the short-term.
The recommended way, at this time, is to enable the tracker in vttablet using -track_schema_versions=true
Also, the reason I was unable to consistently repro earlier was that the tablets always had vstreams running on them which were reloading the schema.So a fresh VStream API client always found the updated schema.
@rohit-nayak-ps thanks for the update.
First of all, I really appreciate your explanation. Also good work in finding and fixing the issue with the internal tables (#7159).
Thanks for letting me know that having vstream running on the tablets is essential in reloading the schema of the tablet. In my case, all VStream API had failed due to #7159. No tracker was enabled by -track_schema_versions=true
either, so no schema reload.
Will enabling tracker with -track_schema_versions=true
on all vttablets adds any perceivable overhead? Why isn't it a default vttablet configuration?
Will enabling tracker with
-track_schema_versions=true
on all vttablets adds any perceivable overhead? Why isn't it a default vttablet configuration?
There is an overhead of an additional vstreamer which will download the binlogs and do the minimal parsing required. Since it only deals with DDLs it is less than a regular vstream.
Whether it is perceptible depends on the server configuration and write QPS. This is precisely why we disable it by default. Originally it was enabled by default, but we had a few customers in production who were affected by it. (iirc) Those with lots of small servers + high QPS saw spikes in CPU usage when they migrated to that version.
The solution is for the tracker to be light-weight. I have done a quick POC by paring down the vstreamer functionality to a minimum and got over 60% reduction in cpu usage. To productionise it would however need a lot of testing since vstreamer would now follow different code paths based on whether it is a "lite" or regular version and vstreamers are in the core of vreplication. So it is not too high on our priority list at this moment. I will create an issue for this soon and if we find more support for it we can take it up earlier!
Closing this. As discussed above, the recommended way to get around this issue is to enable the tracker in vttablet
using -track_schema_versions=true
Overview of the Issue
Our Debezium Vitess Connector (CDC) uses VStream gRPC to stream change events from a sharded (2 shards:
-80
and80-
) keyspace calledtest_sharded_keyspace
.When running the following
gh-ost
online schema migration:VStream gRPC throws a server-side error:
Reproduction Steps
Steps to reproduce this issue:
Deploy the following
vschema
:Deploy the following
schema
:Run VStream gRPC client to continuously stream from the sharded keyspace
test_sharded_keyspace
where the table resides in.The table has 30 million rows.
Run
vtctlclient -server vtctld-host:15999 ApplySchema -sql "ALTER WITH 'gh-ost' TABLE bar_entry add column status int" test_sharded_keyspace
to startgh-ost
online schema migration.Run
vtctlclient -server vtctld-host:15999 OnlineDDL test_sharded_keyspace show recent
to check gh-ost job status, which changes fromqueued
torunning
tocomplete
on each shards (-80
and80-
).Run
show create table bar_entry\G
and see the new columnstatus
is present.VStream gRPC client received the following server-side error:
Binary version