Feature Request: Resume VReplication after error

glortho commented 2 years ago

Feature Description

A recent MoveTables operation failed because we came on a NULL value for a primary sharding key

vttablet: rpc error: code = Unknown desc = could not map [NULL] to a keyspace id, got destination DestinationNone()

When trying to resume, we crossed our fingers that re-running it would pick up where it left off. But then we saw:

Duplicate entry '44' for key 'PRIMARY' (errno 1062) (sqlstate 23000) during query...

We are therefore resetting all data and starting the copy from scratch.

Assuming there is no existing solution for this, it would be very nice to have one!

Let me know if more detail would be helpful. (Reference also: https://github.com/vitessio/vitess/issues/8056.)

cc @arthurschreiber

Use Case(s)

See above.

mattlord commented 2 years ago

Hi @glortho !

I'm not sure if you may have hit a bug here? If it's a feature request, I'm also unclear what that enhancement is exactly. I would say that we do already resume, but that of course does not mean that the first error won't cause a later error or more broadly that you won't encounter another error after resuming.

Can you help clarify by adding additional details?

Thank you!

glortho commented 2 years ago

Thank you for the quick reply @mattlord.

We ran a command like this to copy from an unsharded cluster running VTTablets to a new sharded cluster (with 4 shards):

vtctlclient -server vitess-vtctld.our.host:port MoveTables -source=cluster1_ks \
    -tables='all,the,tables' \
    -tablet_types=rdonly \
    -timeout=120s \
    Create cluster2_ks.cluster1_ks_to_cluster2_ks_movetables

About 1/3 of the data copied before it failed with:

vttablet: rpc error: code = Unknown desc = could not map [NULL] to a keyspace id, got destination DestinationNone()

Indeed, one of the tables had a row with a NULL value where a primary sharding key value was supposed to be. We backfilled that value and tried to resume, running this command:

vtctlclient-server vitess-vtctld.our.host:port MoveTables Progress cluster2_ks.cluster1_ks_to_cluster2_ks_movetables2

Was 👆🏼 this what we were supposed to do? It output all table copy stats and looked promising overall.

But then we found this in the log:

message: Duplicate entry '44' for key 'PRIMARY' (errno 1062) (sqlstate 23000) during query: insert into...

So we deleted all data in the target cluster and started over.

Did we miss a step?

My understanding of the problem here is that once we had that NULL error, there was no way to recover, because we'd always trip on it in the (immutable) binlog.

Please correct any misunderstandings here! I'm still learning. :-)

mattlord commented 2 years ago

Hi @glortho!

OK, in that case I think it was an unrecoverable error because we added a uniqueness constraint and routing change (in effect) after some of the data had already been copied.

Related to this, we're working on detecting unrecoverable errors and handling them differently: https://github.com/vitessio/vitess/pull/10429

It seems that your request may be related to that, meaning that you were confused by the fact that VReplication was resuming/retrying repeatedly and it was not clear that you had an error to deal with until you looked at the logs. Is that a reasonable statement?

Thanks again!

glortho commented 2 years ago

we're working on detecting unrecoverable errors and handling them differently

I do believe that would help yes!

But actually my question/request was indeed about whether there would be some way to layer over this unrecoverable error to be able to resume from there. I'm not feeling optimistic about that given where this conversation is going 😄, but my mental model was some kind of meta-log specific to MoveTables that mirrors binlog except that the fatal exception doesn't get written to it and, on resuming MoveTables, you'd be fast-forwarded right to the copy that threw the exception, to try again.

If that makes no sense, it's me, not you. 😆

mattlord commented 7 months ago

I'm going to close this as completed because I'm not aware of any additional work to do around this today. We have added support for identifying retryable errors and automatically retrying them since this was created. If there's a specific case that we have missed please open a new issue with the specifics. Thanks!

vitessio / vitess

Feature Request: Resume VReplication after error #10456

Feature Description

Use Case(s)