vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.34k stars 2.08k forks source link

Bug Report: Creating shards for a keyspace with differing keyspace ID lengths fails to mark tablet as serving #12535

Closed jeremycole closed 1 year ago

jeremycole commented 1 year ago

Overview of the Issue

Keyspaces with adjacent shard definitions that use a differing number of bytes to represent the keyspace IDs in that shard do not initialize their tablets correctly, resulting in non-serving tablets.

Reproduction Steps

We created a brand new keyspace with the following shards:

All shards came up correctly except for 000280-000300, whose tablet started up but did not mark itself as IsPrimaryServing and thus there was no serving tablet for the shard. We were able to manually get past this by marking the tablet as serving explicitly:

vtctlclient SetShardIsPrimaryServing keyspace-name/000280-000300 true

This was due to an error in key.KeyRangeAdd, used by topotools.ValidateForReshard via combineKeyRanges here:

https://github.com/vitessio/vitess/blob/47611bca3951ecdf442dda5c8fc12f4eb9cff29c/go/vt/topotools/split.go#L67

https://github.com/vitessio/vitess/blob/47611bca3951ecdf442dda5c8fc12f4eb9cff29c/go/vt/key/key.go#L125-L136

The KeyRangeAdd function uses (c.f.) bytes.Equal(first.End, second.Start) to compare the End and Start values without normalizing them in any way, causing the comparison of End value of []byte{0x00, 0x03, 0x00} and Start value of []byte{0x00, 0x03} for the corresponding shards to mismatch, causing the KeyRangeAdd to return nil, false instead of the properly combined range, thus causing validation of the shard topology via ValidateForReshard to fail.

Binary Version

Vitess 15 from `main` branch with custom patches

Operating System and Environment details

Kubernetes 1.24 running on Google's GKE

Log Fragments

No response

jeremycole commented 1 year ago

The bug in KeyRangeAdd is reproducible with the following simple test case which fails on the current main:

func TestKeyRangeAdd1(t *testing.T) {
    keyRange, ok := KeyRangeAdd(stringToKeyRange("000280-000300"), stringToKeyRange("0003-"))
    assert.Equal(t, stringToKeyRange("000280-"), keyRange)
    assert.Equal(t, true, ok)
}