vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.52k stars 2.09k forks source link

Bug Report: VReplication may delete the workflow on failed ReverseTraffic #14276

Closed mattlord closed 4 months ago

mattlord commented 1 year ago

Overview of the Issue

If the ReverseTraffic step fails, the cancellation of the associated steps can delete the workflow — leaving it and the keyspace in a broken state.

Reproduction Steps

Here is a repeatable test case on main:

git checkout main && git pull 
make build

cd examples/local

./101_initial_cluster.sh; mysql < ../common/insert_commerce_data.sql; ./201_customer_tablets.sh; sleep 30; ./202_move_tables.sh

vtctldclient --action_timeout 30s MoveTables --workflow commerce2customer --target-keyspace customer switchtraffic

sleep 15

mysql -vv commerce -e "select * from customer"

let primaryuid=$(vtctldclient GetTablets --keyspace=customer --tablet-type=primary | awk '{print $1}' | cut -d "-" -f2)+0
command mysql -vv -u root --socket=${VTDATAROOT}/vt_0000000${primaryuid}/mysql.sock --binary-as-hex=false vt_customer -e "lock table corder write; select sleep(600);" &

vtctldclient --action_timeout 10s MoveTables --workflow commerce2customer --target-keyspace customer reversetraffic

mysql -vv commerce -e "select * from customer"

vtctldclient workflow --keyspace customer list

vtctldclient --action_timeout 10s MoveTables --workflow commerce2customer --target-keyspace customer switchtraffic

mysql -vv commerce -e "select * from customer"

vtctldclient workflow --keyspace customer list

vtctldclient --action_timeout 10s MoveTables --workflow commerce2customer --target-keyspace customer reversetraffic

mysql -vv commerce -e "select * from customer"

vtctldclient workflow --keyspace customer list

Results being:

...

--------------
lock table corder write
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
select sleep(600)
--------------

E1013 18:09:25.074063   95336 main.go:56] rpc error: code = DeadlineExceeded desc = context deadline exceeded
--------------
select * from customer
--------------

ERROR 1815 (HY000) at line 1: query select * from customer failed after retries: <nil>
Bye
[
  "commerce2customer"
]
SwitchTraffic was successful for workflow customer.commerce2customer

Start State: Reads Not Switched. Writes Switched
Current State: All Reads Switched. Writes Switched

--------------
select * from customer
--------------

ERROR 1815 (HY000) at line 1: query select * from customer failed after retries: <nil>
Bye
[
  "commerce2customer"
]
E1013 18:11:25.986211   95825 main.go:56] rpc error: code = Unknown desc = failed to stop writes in the customer keyspace: Code: INVALID_ARGUMENT
one or more tables are already present in the denylist
--------------
select * from customer
--------------

ERROR 1815 (HY000) at line 1: query select * from customer failed after retries: <nil>
Bye
[]

Binary Version

vtgate version Version: 19.0.0-SNAPSHOT (Git revision 0419d15d1322d907faac5734904fa3db575584f5 branch 'main') built on Fri Oct 13 13:32:24 EDT 2023 by matt@pslord.local using go1.21.2 darwin/arm64

Operating System and Environment details

N/A

Log Fragments

No response

mattlord commented 1 year ago

With this small patch the workflow is no longer deleted, but we still have query serving issues:

diff --git a/go/vt/vtctl/workflow/server.go b/go/vt/vtctl/workflow/server.go
index efa9fc8937..3f1db23511 100644
--- a/go/vt/vtctl/workflow/server.go
+++ b/go/vt/vtctl/workflow/server.go
@@ -3168,10 +3168,12 @@ func (s *Server) switchWrites(ctx context.Context, req *vtctldatapb.WorkflowSwit
            return handleError("failed to reset the sequences", err)
        }

-       ts.Logger().Infof("Creating reverse streams")
-       if err := sw.createReverseVReplication(ctx); err != nil {
-           sw.cancelMigration(ctx, sm)
-           return handleError("failed to create the reverse vreplication streams", err)
+       if req.EnableReverseReplication && ts.workflow != ts.reverseWorkflow {
+           ts.Logger().Infof("Creating reverse streams")
+           if err := sw.createReverseVReplication(ctx); err != nil {
+               sw.cancelMigration(ctx, sm)
+               return handleError("failed to create the reverse vreplication streams", err)
+           }
        }
    } else {
        if cancel {
diff --git a/go/vt/vtctl/workflow/traffic_switcher.go b/go/vt/vtctl/workflow/traffic_switcher.go
index d4fe77130a..11168e5a64 100644
--- a/go/vt/vtctl/workflow/traffic_switcher.go
+++ b/go/vt/vtctl/workflow/traffic_switcher.go
@@ -1010,9 +1010,11 @@ func (ts *trafficSwitcher) cancelMigration(ctx context.Context, sm *StreamMigrat
        ts.Logger().Errorf("Cancel migration failed: could not restart vreplication: %v", err)
    }

-   err = ts.deleteReverseVReplication(ctx)
-   if err != nil {
-       ts.Logger().Errorf("Cancel migration failed: could not delete revers vreplication entries: %v", err)
+   if ts.workflow == ts.reverseWorkflow {
+       err = ts.deleteReverseVReplication(ctx)
+       if err != nil {
+           ts.Logger().Errorf("Cancel migration failed: could not delete reverse vreplication entries: %v", err)
+       }
    }
 }