the auto schema detection does not order the columns in the same way for different ables

lmatz commented 4 months ago

Describe the bug

The user reported that even when the upstream table's schemas are the same, the ordering of the columns is "sometimes" not the same.

Therefore, union cannot union them by default via *.

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

No response

Additional context

No response

lmatz commented 4 months ago

I tried an example, say I created two tables in MYSQL:

CREATE TABLE ExampleTable (
    id INT AUTO_INCREMENT PRIMARY KEY,        -- Integer type with auto-increment
    name VARCHAR(255) NOT NULL,               -- Variable character string
    birth_date DATE,                          -- Date type
    email VARCHAR(255),                       -- Variable character string
    is_active BOOLEAN DEFAULT TRUE,           -- Boolean type
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP -- Timestamp with default current time
);

CREATE TABLE AnotherExampleTable (
    name VARCHAR(255) NOT NULL,               -- Variable character string
    email VARCHAR(255),                       -- Variable character string
    id INT AUTO_INCREMENT PRIMARY KEY,        -- Integer type with auto-increment
    birth_date DATE,                          -- Date type
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, -- Timestamp with default current time
    is_active BOOLEAN DEFAULT TRUE            -- Boolean type
);

They have the same set of columns, but the order of columns is different. It will lead to different “column ordinal” in the information schema:

mysql> SELECT TABLE_NAME, COLUMN_NAME, ORDINAL_POSITION
    -> FROM information_schema.COLUMNS
    -> WHERE TABLE_NAME IN ('ExampleTable', 'AnotherExampleTable')
    -> AND TABLE_SCHEMA = 'public';
+---------------------+-------------+------------------+
| TABLE_NAME          | COLUMN_NAME | ORDINAL_POSITION |
+---------------------+-------------+------------------+
| AnotherExampleTable | name        |                1 |
| AnotherExampleTable | email       |                2 |
| AnotherExampleTable | id          |                3 |
| AnotherExampleTable | birth_date  |                4 |
| AnotherExampleTable | created_at  |                5 |
| AnotherExampleTable | is_active   |                6 |
| ExampleTable        | id          |                1 |
| ExampleTable        | name        |                2 |
| ExampleTable        | birth_date  |                3 |
| ExampleTable        | email       |                4 |
| ExampleTable        | is_active   |                5 |
| ExampleTable        | created_at  |                6 |
+---------------------+-------------+------------------+
12 rows in set (0.01 sec)

The schema discovery library we use will order the columns based on the column ordinals: https://github.com/SeaQL/sea-schema/blob/master/src/mysql/query/column.rs#L72

The ordering produced by this 3rd library will be inherited by RW: https://github.com/risingwavelabs/risingwave/blob/release-1.10/src/connector/src/source/cdc/external/mysql.rs#L102-L118

I don't know if we should do anything on our end to avoid this inconvenience. cc: @StrikeW @neverchanje

StrikeW commented 4 months ago

The user reported that even when the upstream table's schemas are the same, the ordering of the columns is "sometimes" not the same.

IIUC, this is determined by their business, there exist tables have same columns but in different order, am I right?

It is surprised to me that UNION requires the ordering of result set to be same, because when we union two sets, we don't care about the ordering of sets. So one solution is to implement the CORRESPONDING clause (link1, link2).

lmatz commented 4 months ago

IIUC, this is determined by their business, there exist tables have same columns but in different order, am I right?

Yes, and the upstream tables are managed by a different team. The user is not allowed to alter the orderings of the columns in the upstream table.

CORRESPONDING

Thanks, this is a good idea

xiangjinwu commented 4 months ago

Sounds like CORRESPONDING is the standard version of BY NAME mentioned in https://github.com/risingwavelabs/risingwave/issues/15637#issuecomment-1991060486

Without it, the user can easily workaround by manually reordering the columns before union. Or for ease of reuse, create a view over the table just to reorder columns, and use the view rather than underlying table afterwards. Specifying all columns again in correct order may sound receptive but it is no harder than redefining the upstream with correct column order, which is out of their control but willing to do. - Or does turning off aut-schema detection work?

Furthermore, the original issue description stated:

The user reported that even when the upstream table's schemas are the same, the ordering of the columns is "sometimes" not the same.

The example in https://github.com/risingwavelabs/risingwave/issues/17451#issuecomment-2189145094 above does not satisfy the even when part. Does auto-schema detection guarantee the column orders in RisingWave is always the same as upstream?

lmatz commented 4 months ago

Without it, the user can easily workaround by manually reordering the columns before union. Or for ease of reuse, create a view over the table just to reorder columns, and use the view rather than underlying table afterwards. Specifying all columns again in correct order may sound receptive but it is no harder than redefining the upstream with correct column order, which is out of their control but willing to do.

Manual re-ordering by specifying the names of columns again makes the auto schema detection essentially meaningless because we want to avoid typing the names in the first place. But we now need to do it again.

There are different levels of "willing". I believe the fact that auto-schema detection is implemented in RW and other systems indicates that people are not really that willing.

For the change of the upstream table, the producer just needs to do it once. As the consumers of the upstream table via CDC, they can be different people/teams. So everyone needs to do the same again. I have little confidence that the upstream team/or one of the downstream teams is willing to provide a script for the rest so the rest saves the trouble. I just feel that an average organization does not operate in this "nice" way.

The example in https://github.com/risingwavelabs/risingwave/issues/17451#issuecomment-2189145094 above does not satisfy the even when part. Does auto-schema detection guarantee the column orders in RisingWave are always the same as upstream?

As the two lines of code (one in sea and one in RW) suggest, "column ordinal", aka the order of how you specify columns in the table definition determines the orderings we get when auto-detecting the schema from the upstream database to RW.

lmatz commented 4 months ago

Another approach is that we just sort the columns by their name on our side, is there any concern?

BugenZhao commented 4 months ago

Manual re-ordering by specifying the names of columns again makes the auto schema detection essentially meaningless because we want to avoid typing the names in the first place. But we now need to do it again.

However, users are also not able to directly UNION these two tables in the upstream system (without BY NAME). 😕 I don't think it's a problem that we should help to address. Though supporting UNION BY NAME is definitely good.

lmatz commented 4 months ago

However, users are also not able to directly UNION these two tables in the upstream system (without BY NAME). 😕 I don't think it's a problem that we should help to address.

I don't get the logic here it's fine if PG/MySQL decides not to help with this case.

But what exactly prevents us from helping in this case?

xxchan commented 4 months ago

It seems we agreed to help via UNION BY NAME. :eyes: Where's the disagreement?

StrikeW commented 4 months ago

Another approach is that we just sort the columns by their name on our side, is there any concern?

It would be a surprise to user that the order of columns in cdc table is different from its upstream. I think the elegant way is to support UNION CORRESPONDING. The other quick and dirty way is to add a with option to CREATE TABLE to sort columns by name.

lmatz commented 4 months ago

It seems we agreed to help via UNION BY NAME. 👀 Where's the disagreement?

Why not one step further, automatically order the columns of the CDC table according to the name.

I think users can live with the "UNION BY NAME" or "UNION CORRESPONDING" approach. I am not against it.

But I don't see why ordering the columns of the table by name is bad.

It would be a surprise to user that the order of columns in cdc table is different from its upstream.

But this is a big assumption. In this particular case, the user is more surprised why it does not work automatically with columns being the same.

I cannot imagine why the user would care about the order of columns in the table other than the union case.

If there is a meaningful example, case closed. I don't have any counter-arguments anymore. But right now there is none proposed in this thread.

xxchan commented 4 months ago

Why not one step further, automatically order the columns of the CDC table according to the name.

I don't understand what you mean by "automatically order". How to order? :eyes:

Order affects select * (perhaps not that important)
More importantly for UNION, if user have t1(a int, b int), t2(b int, a int), you can see order will affect the result for UNION without BY NAME. That's why we cannot arbitrarily reorder according to what we think might be good. @StrikeW 's idea of introducing an option is somehow more reasonable.

lmatz commented 4 months ago

I don't understand what you mean by "automatically order". How to order? 👀

If the upstream table is create table t(z int, y int, a int), we re-order the columns in the schema according to its name alphabetically, i.e. create table t(a int, y int, z int) when the user define the create table t(*) from CDC source in RW.

you can see order will affect the result for UNION without BY NAME

This has to be discussed under the context of auto schema detection for the CDC tables.

It's weird to see that people would do this to trick themselves on purpose i.e. having two+ tables essentially the same (the reason why they want to union in the first place) and then defining two columns with interchangeable names in two+ tables only to union t1's a and t2's b together instead of t1's a and t2's a.

(never mind if you thought I was referring to automatically re-ordering in union query instead of re-ordering in the source table)

This issue occurs again when we later implement the auto-union feature for CDCing multiple upstream databases/tables. It leaves us with no choice that we have to have the same schema for all the tables involved. (IF the feature is really viable at all)

xxchan commented 4 months ago

cc @xiangjinwu @neverchanje might want to chime in

neverchanje commented 4 months ago

I agree with @lmatz that if the upstream schema is automatically mapped to RW, we don't have to preserve the original column order, especially considering that we'll support CDC from sharded tables and auto schema change in future.

In general, the alignment of column orders will be a nice-to-have, but not a must. We don't have guarantees for it and users should not rely on this behavior.

On the other hand, I also think enforcing the order of both sides of UNION is unnecessary. I tried running a query that unions different column orders in duckdb and mysql, and they both produce results successfully.

Say, giving a query <left> UNION <right>, duckdb and mysql use the schema of <left> as the output schema.

Therefore, here is my proposal:

For CDC from one single table, we still preserve the column order to avoid surprise.
For CDC from multiple tables, the column order of the firstly processed table is used.
UNION, INTERSECT and other set operations: use the schema of <left> by default, thus eliminating the need to introduce the new keyword CORRESPONDING..

Things could be simplified a little bit if we completely ignore the column order of CDC tables, as proposed by Martin. But user will still encounter such inconvenience when dealing with UNION of views, materialized views, Kafka tables, etc.

xiangjinwu commented 4 months ago

Sharing my thoughts:

UNION CORRESPONDING: good extension, but cannot be the implicit default
We shall clearly define one of the following as expected behavior. I would prefer the easiest to implement correctly.
- CDC ingested table has the same column order as upstream
- CDC ingested table has column order sorted alphabetically according to UTF-8
- CDC ingested table does not guarantee column order, especially after alter

This has to be discussed under the context of auto schema detection for the CDC tables.

It's weird to see that people would do this to trick themselves on purpose

But a SQL union executor would not have different behaviors when the input is a CDC table or any other relation. By SQL standard and all existing SQL databases, union works by column order rather than names. That's why CORRESPONDING was introduced later to provide a more human-friendly alternative.

On the other hand, I also think enforcing the order of both sides of UNION is unnecessary. I tried running a query that unions different column orders in duckdb and mysql, and they both produce results successfully.

UNION, INTERSECT and other set operations: use the schema of by default, thus eliminating the need to introduce the new keyword CORRESPONDING.

Did not really get what you mean. I tried select 7 as a, 8 as b union all select 9 as b, 4 as a; on both MySQL 8.0 and DuckDB 1.0.0 and they all union by position. BY NAME (CORRESPONDING) is necessary for DuckDB to switch to the alternative behavior.

neverchanje commented 4 months ago

Thanks. I got it wrong.

That's why CORRESPONDING was introduced later to provide a more human-friendly alternative.

My updated proposal:

For CDC from one single table, we still preserve the column order to avoid surprise. CDC ingested table after alter does not guarantee column order.
For CDC from multiple tables, the column order of the firstly processed table is used.
Support UNION CORRESPONDING.

Do you think it's easy enough to be implemented correctly?

st1page commented 3 months ago

Addendum: The reason DuckDB uses by name instead of corresponding is that they decided to support a different number of columns on each side of the union, filling missing columns on the other side with NULL. I believe we do not need to support this behavior, and in fact, this silent tolerance might be harmful. Therefore, I think we should still use corresponding.

https://github.com/duckdb/duckdb/issues/4320#issuecomment-1209245589

lmatz commented 2 months ago

But a SQL union executor would not have different behaviors when the input is a CDC table or any other relation.

If the upstream table is create table t(z int, y int, a int), we re-order the columns in the schema according to its name alphabetically, i.e. create table t(a int, y int, z int) when the user define the create table t(*) from CDC source in RW.

I don't get it. The idea of re-ordering columns by us does not alter the behavior of "union". The behavior of the union stays the same.

risingwavelabs / risingwave