feat: specifying table schema when creating source/table from avro/protobuf encode whose full schema is externally defined

risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.

https://www.risingwave.com/slack

Apache License 2.0

6.76k stars 557 forks source link

feat: specifying table schema when creating source/table from avro/protobuf encode whose full schema is externally defined #12199

Open st1page opened 12 months ago

st1page commented 12 months ago

Currently, we do not allow user to define schema in column clause

CREATE SOURCE(a int, b int) FORMAT PLAIN ENCODE PROTOBUF;
 ERROR: ExecuteError: Protocol error: User-defined schema is not allowed with FORMAT PLAIN ENCODE PROTOBUF

But some user needs it to prune columns, especially when they are creating a table with the connector, the selected columns determine how many columns will be materialized in storage. Also, casting the source data into the expected datatype is needed too.

Another issue is that if the user want to define a generated column, he must specify columns in the create source/table statement. We might need to introduce another syntax https://github.com/risingwavelabs/risingwave/issues/12209 fixed

[ ] avro
[ ] protobuf
[ ] json with schema registry

fuyufjh commented 12 months ago

How can user specifies a subset of schema when using schema registry?

It feels like this is not necessary. They may just ignore these columns when creating MVs on this source.

How to use generated column when using schema registry?

+1 for the syntax of your proposed.

st1page commented 11 months ago

How can user specifies a subset of schema when using schema registry?

It feels like this is not necessary. They may just ignore these columns when creating MVs on this source.

But when they create tables with primary key, the specified columns influence which columns will be materialized in storage.

hzxa21 commented 10 months ago

How can user specifies a subset of schema when using schema registry?

It feels like this is not necessary. They may just ignore these columns when creating MVs on this source.

One thing semi-related to this issue is #10949. If user can specify a subset of columns, we may be able to filter out unnecessary changes (new row == old row) in the Table's materialize executor during the conflict check.

tabVersion commented 8 months ago

How can user specifies a subset of schema when using schema registry?

It feels like this is not necessary. They may just ignore these columns when creating MVs on this source.

But when they create tables with primary key, the specified columns influence which columns will be materialized in storage.

The feature seems not a pain point. But still have some concerns about the compatibility when updating schema. Let's keep this open and make it a ramp-up task.

tabVersion commented 6 months ago

we can allow user-defined parts only when both name and type are matched with the ones mapped from avro/pb. Let's take another look, cc @st1page

github-actions[bot] commented 2 months ago

This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.