risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://go.risingwave.com/slack
Apache License 2.0
7.09k stars 585 forks source link

[Avro] Support Union types in Avro Schemas with Schema registry #16273

Closed tollercode closed 4 months ago

tollercode commented 7 months ago

Is your feature request related to a problem? Please describe.

When decoding messages, that use a union type within the Avro Schema, RW fails to decode these as currently only 1 Type is supported per field. This requires, that Schemas need to get simplified, e.g. use only string types, which decreases the ability to use strong schemas.

Describe the solution you'd like

ksql introduced an ability to support union types inside avro schemas by creating a 'struct' for this field, that can hold potentially all different types. E.g. a union type of [null, boolean, double, string] becomes --> STRUCT<BOOLEAN BOOLEAN, DOUBLE DOUBLE, STRING VARCHAR(STRING)>

In addition a wildcard struct operator was introduced, to access the struct without knowing the exact field sinside

See: https://www.confluent.io/blog/announcing-ksqldb-0-27-1/#multi-schema-protobuf-avro-topics

This solution would also work for 'oneOf' types in Protobuf and JSON schemas.

Describe alternatives you've considered

A possible alternative is, to cast union types simply into 'strings'. Probably easier to implement, but this will again loosen the strong typing approach.

Additional context

No response

tabVersion commented 7 months ago

It may take some time to reach a consensus on whether it is the desired behavior for RisingWave.

tollercode commented 7 months ago

Just as a reference:

KSQL creates automatically this table schemas using this provided Avro Schema:

{
    "namespace": "com.***.***.mqtt",
    "name": "als.DataMessage",
    "type": "record",
    "fields": [
        {
            "name": "metrics",
            "type": {
                "type": "array",
                "items": {
                    "name": "als_data_metric",
                    "type": "record",
                    "fields": [
                        {
                            "name": "id",
                            "type": "string",
                        },
                        {
                            "name": "name",
                            "type": "string",
                        },
                        {
                            "name": "norm_name",
                            "type": [
                                "null",
                                "string"
                            ],
                            "default": null,
                        },
                        {
                            "name": "uom",
                            "type": [
                                "null",
                                "string"
                            ],
                            "default": null,
                        },
                        {
                            "name": "data",
                            "type": {
                                "type": "array",
                                "items": {
                                    "name": "dataItem",
                                    "type": "record",
                                    "fields": [
                                        {
                                            "name": "ts",
                                            "type": "string",
                                            "doc": "Timestamp of the metric."
                                        },
                                        {
                                            "name": "value",
                                            "type": [
                                                "null",
                                                "boolean",
                                                "double",
                                                "string"
                                            ],
                                            "doc": "Value of the metric."
                                        }
                                    ]
                                }
                            },
                            "doc": "The data message"
                        }
                    ],
                    "doc": "A metric object"
                }
            },
            "doc": "A list of metrics."
        }
    ]
}

KSQL Table

METRICS | ARRAY<STRUCT<ID VARCHAR(STRING), NAME VARCHAR(STRING), NORM_NAME VARCHAR(STRING), UOM VARCHAR(STRING), DATA ARRAY<STRUCT<TS VARCHAR(STRING), VALUE STRUCT<BOOLEAN BOOLEAN, DOUBLE DOUBLE, STRING VARCHAR(STRING)>>>>>

tollercode commented 6 months ago

Any updates on this? This is the only reason blocking us from migrating away from KSQL

fuyufjh commented 5 months ago

Also, IIUC, union with null in Avro schema basically means optional (i.e. nullable). For example,

Reference: https://stackoverflow.com/questions/29299610/is-it-possible-to-have-an-optional-field-in-an-avro-schema-i-e-the-field-does

Additionally, as a result, union might be more frequently used in Avro compared with Protobuf's oneof .

fuyufjh commented 5 months ago

as a result, union might be more frequently used in Avro compared with Protobuf's oneof.

This might be wrong. Post the comments from @xiangjinwu here:

Disagree with this. It was controversial (2015, 2018) until confluent uses it to represent multiple schemas in a single topic in 2020, which is then added in ksql in 2022 (2 years later). My experience here may be limited though.

It is already mid-2024 and I am not against supporting it. But the design space is still quite open compared to other more common data types. I will try to list down the details I am concerning before the end of next week.

xiangjinwu commented 5 months ago

Similar to map #13387 we considered several ways to support it:

However, the interface and semantic of a native union is not as universal as map across databases or programming languages. To avoid committing to a premature design, we will not do it right now. Out of the workarounds:

So we will follow the original struct design without tag for the initial version. To elaborate on its abilities and restrictions: