Discussion: Create downstream schemas when creating the sink

neverchanje commented 7 months ago

Is your feature request related to a problem? Please describe.

This requirement will involve non-trivial development efforts, so let's have a discussion first.

In general, when users create a sink, which possibly contains dozens of columns, someone will need to create the schema in the sink system. The duty of the schema creation is currently left to the users for the simplicity of our system. The issue still remains, although not our concern. The user will have to manually write create table statement, mapping each risingwave data type to the sink's data type. Apparently, it causes inconvenience as well as some data faults if the user doesn't strictly follow our type mapping rules. As an enhancement, I propose to let RisingWave automatically create the downstream schema when a new sink is defined.

On the other hand, it’s possible that the downstream has a predefined schema, e.g, in the case where multi-sinks streaming into one table. In this case, the auto creation should be optional:

create sink s1 from t with (
  auto.schema.create = ‘false’ 
)

I propose that auto.schema.create should be false by default, meaning that we don't do additional steps unless necessary.

Ultimately, I expect the cloud portal should provide this option to users as well. cc @wyhyhyhyh

Describe the solution you'd like

[ ] auto.register.schemas for Kafka Protobuf/Avro sink https://github.com/risingwavelabs/risingwave/issues/13139
[x] #14254
[ ] auto.register.schemas for Postgres sink
[ ] auto.register.schemas for MySQL sink
[ ] auto.register.schemas for Clickhouse sink
[ ] auto.register.schemas for Doris sink

Describe alternatives you've considered

No response

Additional context

No response

fuyufjh commented 7 months ago

Basically +1 for this.

Notice that this question should be discussed for each different sink respectively. Some cases actually doesn't need any discussion.

Kafka sink: It will automatically create missing topics, which is the default behavior of Kafka producer. Perhaps other MQs have similar behavior, I guess.
Cassandra/Redis: They don't have concept of "Table"

While,

For relational databases such as PG, MySQL, ClickHouse, Doris, etc., +1 for providing auto_schema_create and false by default
For data lakes such as Iceberg & Delta Lake, I am afraid the mapping of schema and data types might be a problem because they are not natively relational data model. We may discuss them later.

neverchanje commented 7 months ago

Thanks Eric,

Kafka sink: It will automatically create missing topics, which is the default behavior of Kafka producer. Perhaps other MQs have similar behavior, I guess.

For now, our kafka sink does not ensure whether the topic will eventually be created. The auto topic creation is done at the Kafka side, via the auto.create.topics.enable option rather than AdminClient. If the Kafka server disables this option, the create sink will still succeed without creating the topic.

On the other hand, the sink also needs to register the topic schema if a schema registry is provided.

@tabVersion @xiangjinwu Your opinion is important as well! Please comment and I'll create a tracking issue if no objection on this requirement. Our final agreement will include the following aspects:

We'll provide an auto.schema.create option for all sinks. The concrete behaviors may vary for different systems. Let's specifically discuss in their respective issues.
The option value is false by default.

xiangjinwu commented 6 months ago

It is not enough to just add one option. A lot of auxiliaries options may be necessary eventually and we need a plan how to support them gradually without incurring too much burden on backward compatibility. The auxiliaries options I have in mind mostly deal with the misalignment between RisingWave data types and sink data types, when more than 1 candidate mappings are possible. For example:

When there is no date type in downstream (eg json), do we represent it as string or a number?
When there is no int64 type in downstream (again json), do we represent it as number (lossy) or a string (proto-json)?
When varchar can map to multiple types in downstream (eg MySQL tinytext/text/mediumtext/longtext), which option do we use?
When decimal is fixed scale in downstream (eg avro), what (precision, scale) value do we use?
When struct requires a name in downstream (eg avro / proto), what name do we use?

xiangjinwu commented 6 months ago

Will focus on writing schema registry for avro first, for the subset of data types that has an unarguable default mapping.

[ ] schema registry avro for some types
- boolean -> ["null", "boolean"]
- smallint -> ["null", "int"] (no 16-bit signed integer in avro)
- int -> ["null", "int"]
- bigint -> ["null", "long"]
- real -> ["null", "float"]
- double -> ["null", "double"]
- varchar -> ["null", "string"]
- bytea -> ["null", "bytes"]
- timestamptz -> ["null", {"type": "long", "logicalType": "timestamp-micros"}] (not millis)
- timestamp -> ["null", {"type": "long", "logicalType": "local-timestamp-micros"}] (not millis)
- date -> ["null", {"type": "int", "logicalType": "date"}]
- time -> ["null", {"type": "long", "logicalType": "time-micros"}] (not millis)
- interval -> ["null", {"type": "fixed", "size": 12, "logicalType": "duration"}] (not string as in debezium)
- T[] -> ["null", {"type": "array", "items": T}]
[ ] schema registry avro for remaining types
- decimal: need hint on precision, scale and optionally size
- unreleased version of avro introduces a new big-decimal type AVRO-3779
- jsonb: ["null", "string"] or recursive. The former is more debezium compatible while the latter is similar to anything defined in io.confluent.connect.avro.AvroData.
- struct: RisingWave struct is anonymous but avro requires a name.
- serial / rw_int256
[ ] schema registry for protobuf
[ ] schema registry for json schema
[ ] non-MQ sinks

We may still require manual creation if the user do not like the default we chose. This helps reduce the number of options needed.

neverchanje commented 3 months ago

This is a discussion issue and not suitable for tracking tasks. As per the offline discussion with xiangjinwu, we decided to call the option auto.create instead of auto.schema.create to prevent naming confliction with PG's 'schema'.

Let's move to https://github.com/risingwavelabs/risingwave/issues/13139 for further task tracking.

risingwavelabs / risingwave