vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.67k stars 2.1k forks source link

Bug Report: joining sharded tables with reference tables don't reroute to one in the keyspace #15777

Open wiebeytec opened 6 months ago

wiebeytec commented 6 months ago

Edit: See this comment for the real cause. The routing breaks for all reference tables if not all reference tables are present in the unsharded vschema.

Overview of the Issue

Reference table routing may not work as documented. If I don't explicitely name the copy of a table inside the sharded keyspace in a SELECT ... JOIN, it uses the current USE keyspace. This results in the amount of queries scattered multiplying. When I use 'global routing' / 'unspecified mode', it also uses the source table, not the reference, with the same result.

Expected result: at least for global routing mode, I expected the reference table to be used. And the documentation somewhat suggests that even when addressing the source table or having a USE keyspace defined, that it reroutes:

"A SELECT ... JOIN (or equivalent SELECT ... WHERE) will try to route the query to the keyspace of the table to which the reference (or source) table is being joined."

But granted, perhaps I'm reading my wishes into that. It would be really great if it worked that way though, because only using unspecified mode is really hard when you're moving tables between keyspaces (as they exist in multiple keyspaces at some point, breaking the routing).

Reproduction Steps

I have these (simplified) tables:


-- In unsharded keyspace legacy
CREATE TABLE `dataAttributes` (
  `idDataAttribute` smallint unsigned NOT NULL AUTO_INCREMENT,
  `code` varchar(25) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  PRIMARY KEY (`idDataAttribute`),
  UNIQUE KEY `code_UNIQUE` (`code`),
)

-- in sharded keyspace sites2023
CREATE TABLE `lastLogData` (
  `idSite` int unsigned NOT NULL,
  `idDataAttribute` smallint unsigned NOT NULL,
  PRIMARY KEY (`idSite`,`idDataAttribute`)
)

The table dataAttributes is copied to sites2023 with Materialize, and it's defined in the VSchema for sites2023 as reference:

{
  "sharded": true,
  "vindexes": {
    "a_standard_hash": {
      "type": "hash"
    }
  },
  "tables": {
    "dataAttributes": {
      "type": "reference",
      "source": "legacy.dataAttributes"
    },

    "lastLogData": {
      "column_vindexes": [
        {
          "column": "idSite",
          "name": "a_standard_hash"
        }
      ]
    }
  }
}

When I run this query:

vexplain queries
SELECT lld.idSite, da.code
FROM sites2023.lastLogData AS lld
INNER JOIN dataAttributes AS da
ON lld.idDataAttribute = da.idDataAttribute
WHERE lld.idSite = 121699

with a default USE of legacy (the default USE is the argument to the mysql command, which will affect the table dataAttributes), you can see it multiplies many queries to the legacy keyspace:

# mysql legacy < query-debug-strip.sql | head -n 30
#       keyspace        shard   query
0       sites2023       80-cc   select lld.idSite, lld.idDataAttribute from lastLogData as lld where lld.idSite = 121699
1       legacy          0       select da.`code` from dataAttributes as da where da.idDataAttribute = 1
2       legacy          0       select da.`code` from dataAttributes as da where da.idDataAttribute = 2
3       legacy          0       select da.`code` from dataAttributes as da where da.idDataAttribute = 3
4       legacy          0       select da.`code` from dataAttributes as da where da.idDataAttribute = 6
5       legacy          0       select da.`code` from dataAttributes as da where da.idDataAttribute = 8
6       legacy          0       select da.`code` from dataAttributes as da where da.idDataAttribute = 11
7       legacy          0       select da.`code` from dataAttributes as da where da.idDataAttribute = 14
8       legacy          0       select da.`code` from dataAttributes as da where da.idDataAttribute = 17
etc
etc

But when I address the dataAttributes table directly in sites2023:

vexplain queries
SELECT lld.idSite, da.code
FROM sites2023.lastLogData AS lld
INNER JOIN sites2023.dataAttributes AS da
ON lld.idDataAttribute = da.idDataAttribute
WHERE lld.idSite = 121699

The result is correct:

# mysql legacy < query-debug-strip.sql | head -n 30
#       keyspace        shard   query
0       sites2023       80-cc   select lld.idSite, da.`code` from lastLogData as lld, dataAttributes as da where lld.idSite = 121699 and lld.idDataAttribute = da.idDataAttribute

If I don't specify a default DB with USE, and don't fully quailfy dataAttributes, it says:

ERROR 1105 (HY000) at line 1: table dataAttributes not found

I don't know at this point why my 'global routing' doesn't work, but even if I fix it by adding this vschema to legacy:

{
  "sharded": false,
  "tables": {
    "dataAttributes": {}
  }
}

it still picks the table from the legacy keyspace. To demonstrate, this is again the query with an unqualified dataAttributes:

# cat query-debug-reference.sql 
vexplain queries 
SELECT lld.idSite, da.code
FROM sites2023.lastLogData lld
INNER JOIN dataAttributes da ON
lld.idDataAttribute = da.idDataAttribute
WHERE lld.idSite = 121699

And I run this without a DB argument to mysql. It still erroneously goes to legacy:

# mysql < query-debug-reference.sql | head -n 20
#       keyspace        shard   query
0       sites2023       80-cc   select lld.idSite, lld.idDataAttribute from lastLogData as lld where lld.idSite = 121699
1       legacy 0       select da.`code` from dataAttributes as da where da.idDataAttribute = 1
2       legacy 0       select da.`code` from dataAttributes as da where da.idDataAttribute = 2
3       legacy 0       select da.`code` from dataAttributes as da where da.idDataAttribute = 3
4       legacy 0       select da.`code` from dataAttributes as da where da.idDataAttribute = 6
5       legacy 0       select da.`code` from dataAttributes as da where da.idDataAttribute = 8
6       legacy 0       select da.`code` from dataAttributes as da where da.idDataAttribute = 11
7       legacy 0       select da.`code` from dataAttributes as da where da.idDataAttribute = 14
8       legacy 0       select da.`code` from dataAttributes as da where da.idDataAttribute = 17
9       legacy 0       select da.`code` from dataAttributes as da where da.idDataAttribute = 20
etc

Binary Version

vtgate version Version: 19.0.3 (Git revision cb5464edf5d7075feae744f3580f8bc626d185aa branch 'HEAD') built on Thu Apr  4 12:18:41 UTC 2024 by runner@fv-az1543-228 using go1.22.2 linux/amd64

### Operating System and Environment details

```sh
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
Linux vitess-vtctld 6.2.0-1014-aws #14~22.04.1-Ubuntu SMP Thu Oct  5 22:43:45 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux


### Log Fragments

_No response_
rohit-nayak-ps commented 6 months ago

I am not able to repro this in v19. I ran the end to end test TestReferenceRouting in go/test/endtoend/vtgate/queries/reference. I commented out the teardown // defer clusterInstance.Teardown() . Based on this test setup, I believe the similar queries to what you are running are being planned correctly. Can you check and see what is different in this setup as compared to yours, so we can repro your failure. In particular, what is the value of require_explicit_routing in your vschema?

mysql> vexplain queries select d.id, z.id from sks.delivery_failure d inner join zip_detail z on d.zip_detail_id = z.id where d.id = 1;
+------+----------+-------+---------------------------------------------------------------------------------------------------------+
| #    | keyspace | shard | query                                                                                                   |
+------+----------+-------+---------------------------------------------------------------------------------------------------------+
|    0 | sks      | -80   | select d.id, z.id from delivery_failure as d, zip_detail as z where d.id = 1 and d.zip_detail_id = z.id |
+------+----------+-------+---------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

rohit@RSs-Laptop reference % vttablet --version
vttablet version Version: 19.0.4-SNAPSHOT (Git revision cc64915d0f9c06bd498026c301bdfbca5fd6afc5 branch 'release-19.0') built on Tue Apr 23 11:37:51 CEST 2024 by rohit@RSs-Laptop.local using go1.22.2 darwin/arm64

rohit@RSs-Laptop reference % vtctldclient --server localhost:8106 GetVSchema sks
{
  "sharded": true,
  "vindexes": {
    "hash": {
      "type": "hash",
      "params": {},
      "owner": ""
    }
  },
  "tables": {
    "delivery_failure": {
      "type": "",
      "column_vindexes": [
        {
          "column": "id",
          "name": "hash",
          "columns": []
        }
      ],
      "auto_increment": null,
      "columns": [],
      "pinned": "",
      "column_list_authoritative": false,
      "source": ""
    },
    "zip_detail": {
      "type": "reference",
      "column_vindexes": [],
      "auto_increment": null,
      "columns": [],
      "pinned": "",
      "column_list_authoritative": false,
      "source": "uks.zip_detail"
    }
  },
  "require_explicit_routing": false,
  "foreign_key_mode": "unspecified"
}
wiebeytec commented 6 months ago

My sites2023 has require_explicit_routing: false. But, when I'm going to need unspecified mode, I see no way but to have that true, I think? Because otherwise, as soon as I create another VSchema with table names, the routing will fail. This would block me from transparantly using MoveTables, as tables exist in two keyspaces at the same time.

Does require_explicit_routing indeed also affect reference table decisions?

I compared my situation to the test case, and it seems mostly the same, except that I have more data; I fill the reference tables with Materialize.

Perhaps my issue is that global routing / unspecified mode is not functioning. I don't know why. If I look at the SrvKeyspace, it all looks good. I also tried rebuilding it. What else can I look at for that?

BTW, this is the vexplain of the query:

{
        "OperatorType": "Join",
        "Variant": "Join",
        "JoinColumnIndexes": "L:0,R:0",
        "JoinVars": {
                "lld_idDataAttribute": 1
        },
        "TableName": "lastLogData_dataAttributes",
        "Inputs": [
                {
                        "OperatorType": "Route",
                        "Variant": "EqualUnique",
                        "Keyspace": {
                                "Name": "sites2023",
                                "Sharded": true
                        },
                        "FieldQuery": "select lld.idSite, lld.idDataAttribute from lastLogData as lld where 1 != 1",
                        "Query": "select lld.idSite, lld.idDataAttribute from lastLogData as lld where lld.idSite = 6",
                        "Table": "lastLogData",
                        "Values": [
                                "6"
                        ],
                        "Vindex": "a_standard_hash"
                },
                {
                        "OperatorType": "Route",
                        "Variant": "Unsharded",
                        "Keyspace": {
                                "Name": "legacy",
                                "Sharded": false
                        },
                        "FieldQuery": "select da.`code` from dataAttributes as da where 1 != 1",
                        "Query": "select da.`code` from dataAttributes as da where da.idDataAttribute = :lld_idDataAttribute",
                        "Table": "dataAttributes"
                }
        ]
} |

I will keep trying things in the mean time.

deepthi commented 6 months ago

I think I understand why you can't use unspecified mode. You have the same table name in legacy and in sites2023. How did this come about? Usually, once tables have been moved out of an unsharded keyspace into a sharded one, they are expected to be dropped from the old keyspace. Even in that case, we can expect that there is a non-zero duration between switching traffic and dropping the source, but routing rules will be present to route queries correctly. @rohit-nayak-ps this is probably the root cause for why things aren't working "as expected".

wiebeytec commented 6 months ago

(comment edited, because I was confusing things)

I think I understand why you can't use unspecified mode. You have the same table name in legacy and in sites2023. How did this come about? Usually, once tables have been moved out of an unsharded keyspace into a sharded one, they are expected to be dropped from the old keyspace.

Because dataAttributes is a reference table, copied to sites2023. But, no table from legacy can be addressed as global routing, also not the ones that were never moved. The table lastLogData, which is currently in both keyspaces because the cluster is in switch traffic mode, actually is globally addressable. Tables from meta2023 as well. But, no table from legacy is.

deepthi commented 6 months ago

Because dataAttributes is a reference table, copied to sites2023

Duh, of course. Will let Rohit keep looking into this :)

wiebeytec commented 6 months ago

I think I have some stale information about legacy somewhere, I just cannot find it. It may be causing you to chase red herrings. I don't know if you think it's worth it, or would like to remove all data, including etcd?

wiebeytec commented 6 months ago

I figured it out: the vschema for the legacy needs to have all reference tables from from sites2023 in it. As soon as I leave even one out, doesn't matter which one, it will route back to legacy. I could have made a better repro case there, sorry. It was too far off my radar as a possibilty. So in the test case, sks needs an extra reference table that is not in the vschema uks, yet is in the db itself. Probably something like:

diff --git a/go/test/endtoend/vtgate/queries/reference/main_test.go b/go/test/endtoend/vtgate/queries/reference/main_test.go
index 4c9440ca4f..28cc709d4b 100644
--- a/go/test/endtoend/vtgate/queries/reference/main_test.go
+++ b/go/test/endtoend/vtgate/queries/reference/main_test.go
@@ -98,6 +98,10 @@ var (
                                        "type": "reference",
                                        "source": "` + unshardedKeyspaceName + `.zip_detail"
                                }
+                               "does_not_exist": {
+                                       "type": "reference",
+                                       "source": "` + unshardedKeyspaceName + `.does_not_exist"
+                               }
                        }
                }
        `

It also goes as I expected when I qualify the sites2023 keyspace, either as USE or when qualifying sites2023.lastLogData.

I can work around it, but it seems to me it's important to fix, because if I add extra reference tables in the future without adding to the vschema, I will create that scenario again.

One extra problem though: I get a panic in vtgate when I use unspecified mode in my application now. I will discuss/report that separately.

rohit-nayak-ps commented 6 months ago

I can work around it, but it seems to me it's important to fix, because if I add extra reference tables in the future without adding to the vschema, I will create that scenario again.

Without adding it to the vschema how can Vitess know about a new reference table? Maybe I am misunderstanding. Can you clarify with an example what you expect to work here?

rohit-nayak-ps commented 6 months ago

One extra problem though: I get a panic in vtgate when I use unspecified mode in my application now. I will discuss/report that separately.

I see you logged the stack trace in Slack ...

FYI we have another issue we are looking at, which could impact the routing of reference tables for you as well: https://github.com/vitessio/vitess/issues/15770.

wiebeytec commented 6 months ago

Without adding it to the vschema how can Vitess know about a new reference table? Maybe I am misunderstanding. Can you clarify with an example what you expect to work here?

Isn't that change to the test exactly that? I'm saying that if there is some reference table defined in the sharded keyspace that is not in the unsharded vschema, all reference tables stop working properly.