Closed neverchanje closed 3 months ago
Basically +1 for this.
Notice that this question should be discussed for each different sink respectively. Some cases actually doesn't need any discussion.
While,
auto_schema_create
and false
by defaultThanks Eric,
Kafka sink: It will automatically create missing topics, which is the default behavior of Kafka producer. Perhaps other MQs have similar behavior, I guess.
For now, our kafka sink does not ensure whether the topic will eventually be created. The auto topic creation is done at the Kafka side, via the auto.create.topics.enable
option rather than AdminClient
. If the Kafka server disables this option, the create sink will still succeed without creating the topic.
On the other hand, the sink also needs to register the topic schema if a schema registry is provided.
@tabVersion @xiangjinwu Your opinion is important as well! Please comment and I'll create a tracking issue if no objection on this requirement. Our final agreement will include the following aspects:
auto.schema.create
option for all sinks. The concrete behaviors may vary for different systems. Let's specifically discuss in their respective issues.It is not enough to just add one option. A lot of auxiliaries options may be necessary eventually and we need a plan how to support them gradually without incurring too much burden on backward compatibility. The auxiliaries options I have in mind mostly deal with the misalignment between RisingWave data types and sink data types, when more than 1 candidate mappings are possible. For example:
date
type in downstream (eg json), do we represent it as string or a number?int64
type in downstream (again json), do we represent it as number (lossy) or a string (proto-json)?varchar
can map to multiple types in downstream (eg MySQL tinytext
/text
/mediumtext
/longtext
), which option do we use?decimal
is fixed scale in downstream (eg avro), what (precision, scale)
value do we use?struct
requires a name in downstream (eg avro / proto), what name do we use?Will focus on writing schema registry for avro first, for the subset of data types that has an unarguable default mapping.
["null", "boolean"]
["null", "int"]
(no 16-bit signed integer in avro)["null", "int"]
["null", "long"]
["null", "float"]
["null", "double"]
["null", "string"]
["null", "bytes"]
["null", {"type": "long", "logicalType": "timestamp-micros"}]
(not millis)["null", {"type": "long", "logicalType": "local-timestamp-micros"}]
(not millis)["null", {"type": "int", "logicalType": "date"}]
["null", {"type": "long", "logicalType": "time-micros"}]
(not millis)["null", {"type": "fixed", "size": 12, "logicalType": "duration"}]
(not string as in debezium)["null", {"type": "array", "items": T}]
big-decimal
type AVRO-3779["null", "string"]
or recursive. The former is more debezium compatible while the latter is similar to anything
defined in io.confluent.connect.avro.AvroData
.We may still require manual creation if the user do not like the default we chose. This helps reduce the number of options needed.
This is a discussion issue and not suitable for tracking tasks. As per the offline discussion with xiangjinwu, we decided to call the option auto.create
instead of auto.schema.create
to prevent naming confliction with PG's 'schema'.
Let's move to https://github.com/risingwavelabs/risingwave/issues/13139 for further task tracking.
Is your feature request related to a problem? Please describe.
This requirement will involve non-trivial development efforts, so let's have a discussion first.
In general, when users create a sink, which possibly contains dozens of columns, someone will need to create the schema in the sink system. The duty of the schema creation is currently left to the users for the simplicity of our system. The issue still remains, although not our concern. The user will have to manually write create table statement, mapping each risingwave data type to the sink's data type. Apparently, it causes inconvenience as well as some data faults if the user doesn't strictly follow our type mapping rules. As an enhancement, I propose to let RisingWave automatically create the downstream schema when a new sink is defined.
On the other hand, it’s possible that the downstream has a predefined schema, e.g, in the case where multi-sinks streaming into one table. In this case, the auto creation should be optional:
I propose that
auto.schema.create
should be false by default, meaning that we don't do additional steps unless necessary.Ultimately, I expect the cloud portal should provide this option to users as well. cc @wyhyhyhyh
Describe the solution you'd like
auto.register.schemas
for Kafka Protobuf/Avro sink https://github.com/risingwavelabs/risingwave/issues/13139auto.register.schemas
for Postgres sinkauto.register.schemas
for MySQL sinkauto.register.schemas
for Clickhouse sinkauto.register.schemas
for Doris sinkDescribe alternatives you've considered
No response
Additional context
No response