vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.67k stars 2.1k forks source link

Bug Report: Panic when accessing table from denied_tables #16858

Open wiebeytec opened 1 month ago

wiebeytec commented 1 month ago

Overview of the Issue

When I perform operations on tables that are in the denied_tables (because of MoveTables) on the shard tablet control, I get a panic, and the local client reports "Lost connection to MySQL server during query".

This is very confusing to programmers. Even though they're not supposed to use that table, they don't see what they're doing wrong.

Expected result: a query error is returned saying something about that the table is marked as denied. I think think this used to be the case before. Not sure when it changed.

Reproduction Steps

Put a table in the denied_tables:

./vtctldclient SetShardTabletControl --denied-tables "widgets" legacy/0 primary
./vtctldclient RefreshStateByShard legacy/0

Then when you select from it:

mysql> select * from legacy.widgets;
ERROR 2013 (HY000): Lost connection to MySQL server during query
No connection. Trying to reconnect...
Connection id:    198
Current database: legacy

ERROR 2013 (HY000): Lost connection to MySQL server during query
No connection. Trying to reconnect...
Connection id:    199
Current database: legacy

ERROR 2013 (HY000): Lost connection to MySQL server during query

Binary Version

vtgate version Version: 20.0.2 (Git revision 2592c5932b3036647868299b6df76f8ef28dfbc8 branch 'HEAD') built on Wed Sep 11 08:15:20 UTC 2024 by runner@fv-az1152-369 using go1.22.7 linux/amd64

But I've been seeing this behavior for a while, so probably other versions too.

Operating System and Environment details

# cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.5 LTS"

### Log Fragments

```sh
sep 27 11:00:26 vitess-unittest start_vtgate[3174554]: E0927 11:00:26.673458 3174554 server.go:373] mysql_server caught panic:
sep 27 11:00:26 vitess-unittest start_vtgate[3174554]: runtime error: invalid memory address or nil pointer dereference
sep 27 11:00:26 vitess-unittest start_vtgate[3174554]: runtime/panic.go:261 (0x4574d7)
sep 27 11:00:26 vitess-unittest start_vtgate[3174554]: runtime/signal_unix.go:881 (0x4574a5)
sep 27 11:00:26 vitess-unittest start_vtgate[3174554]: vitess.io/vitess/go/vt/vtgate/buffer/buffer.go:168 (0x1638152)
sep 27 11:00:26 vitess-unittest start_vtgate[3174554]: vitess.io/vitess/go/vt/vtgate/plan_execute.go:107 (0x163813b)
sep 27 11:00:26 vitess-unittest start_vtgate[3174554]: vitess.io/vitess/go/vt/vtgate/executor.go:432 (0x162cf8f)
sep 27 11:00:26 vitess-unittest start_vtgate[3174554]: vitess.io/vitess/go/vt/vtgate/executor.go:228 (0x162b4b1)
sep 27 11:00:26 vitess-unittest start_vtgate[3174554]: vitess.io/vitess/go/vt/vtgate/vtgate.go:462 (0x1669fb7)
sep 27 11:00:26 vitess-unittest start_vtgate[3174554]: vitess.io/vitess/go/vt/vtgate/plugin_mysql_server.go:259 (0x163b95e)
sep 27 11:00:26 vitess-unittest start_vtgate[3174554]: vitess.io/vitess/go/mysql/conn.go:1400 (0x109261d)
sep 27 11:00:26 vitess-unittest start_vtgate[3174554]: vitess.io/vitess/go/mysql/conn.go:1385 (0x109230a)
sep 27 11:00:26 vitess-unittest start_vtgate[3174554]: vitess.io/vitess/go/mysql/conn.go:951 (0x108e8e4)
sep 27 11:00:26 vitess-unittest start_vtgate[3174554]: vitess.io/vitess/go/mysql/server.go:552 (0x10ad2af)
sep 27 11:00:26 vitess-unittest start_vtgate[3174554]: vitess.io/vitess/go/mysql/server.go:356 (0x10abeeb)
sep 27 11:00:26 vitess-unittest start_vtgate[3174554]: runtime/asm_amd64.s:1695 (0x479a20)
mattlord commented 1 month ago

cc @vitessio/query-serving

harshit-gangal commented 1 month ago

This could be happening as buffering is default false and we are accessing it is the code without that check

arthurschreiber commented 3 weeks ago

@wiebeytec I think this was fixed via https://github.com/vitessio/vitess/pull/16922. Can this issue be closed?

wiebeytec commented 3 weeks ago

I tested with 21.0.0-rc2 (Git revision 54fa8d887fb0c154dae99b1668e4748a8f40fe42 and the current behavior seems incorrect.

There are these cases:

Case 1

vexplain queries select * from legacy.sites limit 1;
-- 30 seconds pass
-- vexplain output, saying the query went to another keyspace, a sharded one.

This takes exactly 30 seconds before it returns, saying that it's not getting it from legacy, but from sites2024. Presumably the retry doesn't use the db prefix and ends up in 'global routing' mode on the retry. That's because case 2 below, is different.

Case 2

The same as case one, but then with a default DB:

use legacy;
vexplain queries select * from legacy.sites limit 1;
-- 60 seconds pass (indeed, 60 seconds)
ERROR 1105 (HY000): query vexplain queries select * from legacy.sites limit 1 failed after retries: <nil>

Case 3

select * from legacy.widgets limit 1;
-- 60 seconds pass
ERROR 1105 (HY000): query select * from legacy.widgets limit 1 failed after retries: <nil>

Expected behavior

Instant result saying that access to the table is denied because it's in the deny list on this shard.