risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://go.risingwave.com/slack
Apache License 2.0
6.8k stars 564 forks source link

feat(sink): `auto.create` for automatically creating sink tables/topics #13139

Open xiangjinwu opened 10 months ago

xiangjinwu commented 10 months ago

Is your feature request related to a problem? Please describe.

https://docs.confluent.io/platform/current/schema-registry/schema_registry_onprem_tutorial.html#auto-schema-registration

This is very convenient in development environments, but in production environments we recommend that client applications do not automatically register new schemas. Best practice is to register schemas outside of the client application to control when schemas are registered with Schema Registry and how they evolve.

Even for a simple table create table t (k int, val varchar); The avro definition could be:

{
  "type": "record",
  "name": "t",
  "fields": [
    {"name": "k", "type": ["null", "int"]},
    {"name": "v", "type": ["null", "string"]}
  ]
}

Inputting this manually is cumbersome and error-prone.

Task items

Describe the solution you'd like

https://github.com/risingwavelabs/risingwave/blob/main/e2e_test/sink/kafka/avro.slt

create sink sink0 from into_kafka with (
  connector = 'kafka',
  topic = 'test-rw-sink-upsert-avro',
  properties.bootstrap.server = 'message_queue:29092',
  primary_key = 'int32_field,string_field')
format upsert encode avro (
  schema.registry = 'http://message_queue:8081',
  auto.register.schemas = true);

Describe alternatives you've considered

Additional context

Limitations of an auto generated definition:

neverchanje commented 10 months ago

Once we can show the sink schema via a sql command, we can then implement a python script that auto registers the schema registry. It doesn't have to be implemented in the kernel, unless we see a similar requirements.

hzxa21 commented 10 months ago

Once we can show the sink schema via a sql command, we can then implement a python script that auto registers the schema registry. It doesn't have to be implemented in the kernel, unless we see a similar requirements.

Hmm... Where and who will run the python script? If it is the DB admin, not the SQL user and the script is not run in SQL, I think it is still a development burden.

neverchanje commented 9 months ago

Can be anyone who has the privilege to read table columns and register schema to Schema Registry.

neverchanje commented 9 months ago

As discussed previously with Patrick, we agree that auto-registering schemas is a valid requirement. Almost all commercial Extract&Load systems like Fivetran will create downstream tables automatically. https://github.com/risingwavelabs/risingwave/issues/13718

For the user-facing option name, let's align it to auto.create. We'll support this option for other sinks as well in the future.