vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.23k stars 2.07k forks source link

Bug Report: Error indicates `current keyspace is being resharded` when PRIMARY is down and no resharding is happening #10628

Closed aquarapid closed 10 months ago

aquarapid commented 2 years ago

Overview of the Issue

When inserting rows across shards in a sharded keyspace, and one of the shards being inserted to has a NOT_SERVING PRIMARY tablet, the following error is seen:

ERROR 1105 (HY000): transaction rolled back to reverse changes of partial DML execution: target: sharded.-80.primary: current keyspace is being resharded

The first part of the error is correct, but the part about resharding is inaccurate and confusing.

Reproduction Steps

Vschema:

{
    "sharded": true,
    "vindexes": {
        "hash": {
            "type": "hash"
        }
    },
    "tables": {
        "t1": {
            "column_vindexes": [
                {
                    "column": "c1",
                    "name": "hash"
                }
            ]
        }
    }
}

schema:

create table t1 ( c1 int, primary key(c1) ) engine=innodb;

Now, take the primary for one of the tablets in the sharded keyspace offline:

mysql> show vitess_tablets;                           
+-------+-----------+-------+------------+-------------+------------------+---------------+----------------------+
| Cell  | Keyspace  | Shard | TabletType | State       | Alias            | Hostname      | PrimaryTermStartTime |
+-------+-----------+-------+------------+-------------+------------------+---------------+----------------------+
| zone1 | sharded   | -80   | PRIMARY    | NOT_SERVING | zone1-0000000200 | 192.168.0.134 | 2022-06-30T13:10:07Z |
| zone1 | sharded   | -80   | REPLICA    | SERVING     | zone1-0000000300 | 192.168.0.134 |                      |
| zone1 | sharded   | 80-   | PRIMARY    | SERVING     | zone1-0000000201 | 192.168.0.134 | 2022-06-30T13:10:07Z |
| zone1 | sharded   | 80-   | REPLICA    | SERVING     | zone1-0000000301 | 192.168.0.134 |                      
+-------+-----------+-------+------------+-------------+------------------+---------------+----------------------+

Now, insert multiple rows that span the shards and observe the error:

mysql> insert into t1 (c1) values (2),(3),(4),(5),(6);
ERROR 1105 (HY000): transaction rolled back to reverse changes of partial DML execution: target: sharded.-80.primary: current keyspace is being resharded

Binary Version

Version: 15.0.0-SNAPSHOT (Git revision 9199470f84406a62b8e70c7766df53f3e6e46c96 branch 'main') built on Wed Jun 29 17:29:14 PDT 2022 by jacques@dhoomdeskw.localdomain using go1.18.3 linux/amd64

Operating System and Environment details

Fedora 36, amd64

Log Fragments

vtgate logs:

This is the vttablet being stopped (to make it NOT_SERVING):

W0630 10:47:57.182639   34352 component.go:41] [core] grpc: addrConn.createTransport failed to connect to {192.168.0.134:16200 192.168.0.134:16200 <nil> <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 192.168.0.134:16200: connect: connection refused"
W0630 10:47:57.182823   34352 tablet_health_check.go:335] tablet alias:{cell:"zone1" uid:200} hostname:"192.168.0.134" port_map:{key:"grpc" value:16200} port_map:{key:"vt" value:15200} keyspace:"sharded" shard:"-80" key_range:{end:"\x80"} type:PRIMARY mysql_hostname:"192.168.0.134" mysql_port:17200 primary_term_start_time:{seconds:1656594607 nanoseconds:78883826} db_server_version:"8.0.28" default_conn_collation:255 healthcheck stream error: Code: UNAVAILABLE
vttablet: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 192.168.0.134:16200: connect: connection refused"

This is when the insert is executed:

I0630 10:48:22.284492   34352 shard_buffer.go:282] Starting buffering for shard: sharded/-80 (window: 10s, size: 1000, max failover duration: 20s) (A failover was detected by this seen error: current keyspace is being resharded.)

No other relevant errors are seen.

deepthi commented 10 months ago

We believe this should also be fixed by https://github.com/vitessio/vitess/pull/13856. Please re-open if you continue to see this after v18 (or a main build that includes that PR).