Open lmatz opened 4 months ago
I tried an example, say I created two tables in MYSQL:
CREATE TABLE ExampleTable (
id INT AUTO_INCREMENT PRIMARY KEY, -- Integer type with auto-increment
name VARCHAR(255) NOT NULL, -- Variable character string
birth_date DATE, -- Date type
email VARCHAR(255), -- Variable character string
is_active BOOLEAN DEFAULT TRUE, -- Boolean type
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP -- Timestamp with default current time
);
CREATE TABLE AnotherExampleTable (
name VARCHAR(255) NOT NULL, -- Variable character string
email VARCHAR(255), -- Variable character string
id INT AUTO_INCREMENT PRIMARY KEY, -- Integer type with auto-increment
birth_date DATE, -- Date type
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, -- Timestamp with default current time
is_active BOOLEAN DEFAULT TRUE -- Boolean type
);
They have the same set of columns, but the order of columns is different. It will lead to different “column ordinal” in the information schema:
mysql> SELECT TABLE_NAME, COLUMN_NAME, ORDINAL_POSITION
-> FROM information_schema.COLUMNS
-> WHERE TABLE_NAME IN ('ExampleTable', 'AnotherExampleTable')
-> AND TABLE_SCHEMA = 'public';
+---------------------+-------------+------------------+
| TABLE_NAME | COLUMN_NAME | ORDINAL_POSITION |
+---------------------+-------------+------------------+
| AnotherExampleTable | name | 1 |
| AnotherExampleTable | email | 2 |
| AnotherExampleTable | id | 3 |
| AnotherExampleTable | birth_date | 4 |
| AnotherExampleTable | created_at | 5 |
| AnotherExampleTable | is_active | 6 |
| ExampleTable | id | 1 |
| ExampleTable | name | 2 |
| ExampleTable | birth_date | 3 |
| ExampleTable | email | 4 |
| ExampleTable | is_active | 5 |
| ExampleTable | created_at | 6 |
+---------------------+-------------+------------------+
12 rows in set (0.01 sec)
The schema discovery library we use will order the columns based on the column ordinals: https://github.com/SeaQL/sea-schema/blob/master/src/mysql/query/column.rs#L72
The ordering produced by this 3rd library will be inherited by RW: https://github.com/risingwavelabs/risingwave/blob/release-1.10/src/connector/src/source/cdc/external/mysql.rs#L102-L118
I don't know if we should do anything on our end to avoid this inconvenience. cc: @StrikeW @neverchanje
The user reported that even when the upstream table's schemas are the same, the ordering of the columns is "sometimes" not the same.
IIUC, this is determined by their business, there exist tables have same columns but in different order, am I right?
It is surprised to me that UNION
requires the ordering of result set to be same, because when we union two sets, we don't care about the ordering of sets. So one solution is to implement the CORRESPONDING
clause (link1, link2).
IIUC, this is determined by their business, there exist tables have same columns but in different order, am I right?
Yes, and the upstream tables are managed by a different team. The user is not allowed to alter the orderings of the columns in the upstream table.
CORRESPONDING
Thanks, this is a good idea
Sounds like CORRESPONDING
is the standard version of BY NAME
mentioned in https://github.com/risingwavelabs/risingwave/issues/15637#issuecomment-1991060486
Without it, the user can easily workaround by manually reordering the columns before union. Or for ease of reuse, create a view
over the table just to reorder columns, and use the view rather than underlying table afterwards. Specifying all columns again in correct order may sound receptive but it is no harder than redefining the upstream with correct column order, which is out of their control but willing to do. - Or does turning off aut-schema detection work?
Furthermore, the original issue description stated:
The user reported that even when the upstream table's schemas are the same, the ordering of the columns is "sometimes" not the same.
The example in https://github.com/risingwavelabs/risingwave/issues/17451#issuecomment-2189145094 above does not satisfy the even when part. Does auto-schema detection guarantee the column orders in RisingWave is always the same as upstream?
Without it, the user can easily workaround by manually reordering the columns before union. Or for ease of reuse, create a view over the table just to reorder columns, and use the view rather than underlying table afterwards. Specifying all columns again in correct order may sound receptive but it is no harder than redefining the upstream with correct column order, which is out of their control but willing to do.
Manual re-ordering by specifying the names of columns again makes the auto schema detection essentially meaningless because we want to avoid typing the names in the first place. But we now need to do it again.
There are different levels of "willing". I believe the fact that auto-schema detection is implemented in RW and other systems indicates that people are not really that willing.
For the change of the upstream table, the producer just needs to do it once. As the consumers of the upstream table via CDC, they can be different people/teams. So everyone needs to do the same again. I have little confidence that the upstream team/or one of the downstream teams is willing to provide a script for the rest so the rest saves the trouble. I just feel that an average organization does not operate in this "nice" way.
The example in https://github.com/risingwavelabs/risingwave/issues/17451#issuecomment-2189145094 above does not satisfy the even when part. Does auto-schema detection guarantee the column orders in RisingWave are always the same as upstream?
As the two lines of code (one in sea and one in RW) suggest, "column ordinal", aka the order of how you specify columns in the table definition determines the orderings we get when auto-detecting the schema from the upstream database to RW.
Another approach is that we just sort the columns by their name on our side, is there any concern?
Manual re-ordering by specifying the names of columns again makes the auto schema detection essentially meaningless because we want to avoid typing the names in the first place. But we now need to do it again.
However, users are also not able to directly UNION
these two tables in the upstream system (without BY NAME
). 😕 I don't think it's a problem that we should help to address. Though supporting UNION BY NAME
is definitely good.
However, users are also not able to directly UNION these two tables in the upstream system (without BY NAME). 😕 I don't think it's a problem that we should help to address.
I don't get the logic here it's fine if PG/MySQL decides not to help with this case.
But what exactly prevents us from helping in this case?
It seems we agreed to help via UNION BY NAME
. :eyes: Where's the disagreement?
Another approach is that we just sort the columns by their name on our side, is there any concern?
It would be a surprise to user that the order of columns in cdc table is different from its upstream.
I think the elegant way is to support UNION CORRESPONDING
. The other quick and dirty way is to add a with option to CREATE TABLE
to sort columns by name.
It seems we agreed to help via UNION BY NAME. 👀 Where's the disagreement?
Why not one step further, automatically order the columns of the CDC table according to the name.
I think users can live with the "UNION BY NAME" or "UNION CORRESPONDING" approach. I am not against it.
But I don't see why ordering the columns of the table by name is bad.
It would be a surprise to user that the order of columns in cdc table is different from its upstream.
But this is a big assumption. In this particular case, the user is more surprised why it does not work automatically with columns being the same.
I cannot imagine why the user would care about the order of columns in the table other than the union
case.
If there is a meaningful example, case closed. I don't have any counter-arguments anymore. But right now there is none proposed in this thread.
Why not one step further, automatically order the columns of the CDC table according to the name.
I don't understand what you mean by "automatically order". How to order? :eyes:
select *
(perhaps not that important)UNION
, if user have t1(a int, b int)
, t2(b int, a int)
, you can see order will affect the result for UNION
without BY NAME
. That's why we cannot arbitrarily reorder according to what we think might be good. @StrikeW 's idea of introducing an option is somehow more reasonable.I don't understand what you mean by "automatically order". How to order? 👀
If the upstream table is create table t(z int, y int, a int)
, we re-order the columns in the schema according to its name alphabetically,
i.e. create table t(a int, y int, z int)
when the user define the create table t(*) from CDC source
in RW.
you can see order will affect the result for UNION without BY NAME
This has to be discussed under the context of auto schema detection for the CDC tables.
It's weird to see that people would do this to trick themselves on purpose
i.e. having two+ tables essentially the same (the reason why they want to union in the first place) and then defining two columns with interchangeable names in two+ tables only to union t1's a
and t2's b
together instead of t1's a
and t2's a
.
(never mind if you thought I was referring to automatically re-ordering in union
query instead of re-ordering in the source table)
This issue occurs again when we later implement the auto-union feature for CDCing multiple upstream databases/tables. It leaves us with no choice that we have to have the same schema for all the tables involved. (IF the feature is really viable at all)
cc @xiangjinwu @neverchanje might want to chime in
I agree with @lmatz that if the upstream schema is automatically mapped to RW, we don't have to preserve the original column order, especially considering that we'll support CDC from sharded tables and auto schema change in future.
In general, the alignment of column orders will be a nice-to-have, but not a must. We don't have guarantees for it and users should not rely on this behavior.
On the other hand, I also think enforcing the order of both sides of UNION is unnecessary. I tried running a query that unions different column orders in duckdb and mysql, and they both produce results successfully.
Say, giving a query <left> UNION <right>
, duckdb and mysql use the schema of <left>
as the output schema.
Therefore, here is my proposal:
<left>
by default, thus eliminating the need to introduce the new keyword CORRESPONDING
..Things could be simplified a little bit if we completely ignore the column order of CDC tables, as proposed by Martin. But user will still encounter such inconvenience when dealing with UNION of views, materialized views, Kafka tables, etc.
Sharing my thoughts:
UNION CORRESPONDING
: good extension, but cannot be the implicit defaultalter
This has to be discussed under the context of auto schema detection for the CDC tables.
It's weird to see that people would do this to trick themselves on purpose
But a SQL union
executor would not have different behaviors when the input is a CDC table or any other relation. By SQL standard and all existing SQL databases, union
works by column order rather than names. That's why CORRESPONDING
was introduced later to provide a more human-friendly alternative.
On the other hand, I also think enforcing the order of both sides of UNION is unnecessary. I tried running a query that unions different column orders in duckdb and mysql, and they both produce results successfully.
UNION, INTERSECT and other set operations: use the schema of
by default, thus eliminating the need to introduce the new keyword CORRESPONDING
.
Did not really get what you mean. I tried select 7 as a, 8 as b union all select 9 as b, 4 as a;
on both MySQL 8.0 and DuckDB 1.0.0 and they all union by position. BY NAME
(CORRESPONDING
) is necessary for DuckDB to switch to the alternative behavior.
Thanks. I got it wrong.
That's why CORRESPONDING was introduced later to provide a more human-friendly alternative.
My updated proposal:
alter
does not guarantee column order.Do you think it's easy enough to be implemented correctly?
Addendum: The reason DuckDB uses by name
instead of corresponding
is that they decided to support a different number of columns on each side of the union, filling missing columns on the other side with NULL. I believe we do not need to support this behavior, and in fact, this silent tolerance might be harmful. Therefore, I think we should still use corresponding
.
https://github.com/duckdb/duckdb/issues/4320#issuecomment-1209245589
But a SQL union executor would not have different behaviors when the input is a CDC table or any other relation.
If the upstream table is create table t(z int, y int, a int), we re-order the columns in the schema according to its name alphabetically, i.e. create table t(a int, y int, z int) when the user define the create table t(*) from CDC source in RW.
I don't get it. The idea of re-ordering columns by us does not alter the behavior of "union".
The behavior of the union
stays the same.
Describe the bug
The user reported that even when the upstream table's schemas are the same, the ordering of the columns is "sometimes" not the same.
Therefore,
union
cannot union them by default via*
.Error message/log
No response
To Reproduce
No response
Expected behavior
No response
How did you deploy RisingWave?
No response
The version of RisingWave
No response
Additional context
No response