risingwavelabs / risingwave

SQL stream processing, analytics, and management. We decouple storage and compute to offer instant failover, dynamic scaling, speedy bootstrapping, and efficient joins.
https://www.risingwave.com/slack
Apache License 2.0
6.56k stars 536 forks source link

Discussion: Create downstream schemas when creating the sink #13718

Closed neverchanje closed 3 months ago

neverchanje commented 7 months ago

Is your feature request related to a problem? Please describe.

This requirement will involve non-trivial development efforts, so let's have a discussion first.

In general, when users create a sink, which possibly contains dozens of columns, someone will need to create the schema in the sink system. The duty of the schema creation is currently left to the users for the simplicity of our system. The issue still remains, although not our concern. The user will have to manually write create table statement, mapping each risingwave data type to the sink's data type. Apparently, it causes inconvenience as well as some data faults if the user doesn't strictly follow our type mapping rules. As an enhancement, I propose to let RisingWave automatically create the downstream schema when a new sink is defined.

On the other hand, it’s possible that the downstream has a predefined schema, e.g, in the case where multi-sinks streaming into one table. In this case, the auto creation should be optional:

create sink s1 from t with (
  auto.schema.create = ‘false’ 
)

I propose that auto.schema.create should be false by default, meaning that we don't do additional steps unless necessary.

Ultimately, I expect the cloud portal should provide this option to users as well. cc @wyhyhyhyh

Describe the solution you'd like

Describe alternatives you've considered

No response

Additional context

No response

fuyufjh commented 7 months ago

Basically +1 for this.

Notice that this question should be discussed for each different sink respectively. Some cases actually doesn't need any discussion.

While,

neverchanje commented 7 months ago

Thanks Eric,

Kafka sink: It will automatically create missing topics, which is the default behavior of Kafka producer. Perhaps other MQs have similar behavior, I guess.

For now, our kafka sink does not ensure whether the topic will eventually be created. The auto topic creation is done at the Kafka side, via the auto.create.topics.enable option rather than AdminClient. If the Kafka server disables this option, the create sink will still succeed without creating the topic.

On the other hand, the sink also needs to register the topic schema if a schema registry is provided.

@tabVersion @xiangjinwu Your opinion is important as well! Please comment and I'll create a tracking issue if no objection on this requirement. Our final agreement will include the following aspects:

  1. We'll provide an auto.schema.create option for all sinks. The concrete behaviors may vary for different systems. Let's specifically discuss in their respective issues.
  2. The option value is false by default.
xiangjinwu commented 6 months ago

It is not enough to just add one option. A lot of auxiliaries options may be necessary eventually and we need a plan how to support them gradually without incurring too much burden on backward compatibility. The auxiliaries options I have in mind mostly deal with the misalignment between RisingWave data types and sink data types, when more than 1 candidate mappings are possible. For example:

xiangjinwu commented 6 months ago

Will focus on writing schema registry for avro first, for the subset of data types that has an unarguable default mapping.

We may still require manual creation if the user do not like the default we chose. This helps reduce the number of options needed.

neverchanje commented 3 months ago

This is a discussion issue and not suitable for tracking tasks. As per the offline discussion with xiangjinwu, we decided to call the option auto.create instead of auto.schema.create to prevent naming confliction with PG's 'schema'.

Let's move to https://github.com/risingwavelabs/risingwave/issues/13139 for further task tracking.