risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://go.risingwave.com/slack
Apache License 2.0
6.95k stars 574 forks source link

Support logicalType used in Avro Union type #17616

Open xxchan opened 3 months ago

xxchan commented 3 months ago

Separated from the discussion of https://github.com/risingwavelabs/risingwave/pull/17485#discussion_r1668012859

The problem of logicalType in Union

First of all, in Avro

There are 2 main questions

  1. The invalid schema (union with logical type AND the physical type): ["int", {"type":"int", "logicalType": "date"}].
    • If the SDK is correct, such schema should not be created. (https://issues.apache.org/jira/browse/AVRO-2380) But is it possible that the user is using a buggy Avro writer, and they cannot control the producer?
    • Note that even such schema is allowed, the variant of the logicalType can never be constructed.
    • Therefore, we might want to be tolerant to allow it. This point makes some sense (like allowing out-of-range JSON numbers?), but currently Rust Avro lib will reject it.
    • Java rejects it (correctly), Rust allows it.
  2. Which field name to use for logical types (logical or actual).
    • e.g., struct<"double" double, "date" date> vs struct<"double" double, "int" date>
    • Both can work, as we always have the index of the field when we get a Union Value.
    • The intuitive and user-friendly choice is using the logical type.
    • One possible counter-example: decimal (the physical type is a named type)
      • [{"type":"fixed","name":"Decimal128","size":16,"logicalType":"decimal","precision":38,"scale":2}, {"type":"fixed","name":"Decimal256","size":32,"logicalType":"decimal","precision":50,"scale":2}]
      • If we have only one decimal in the union, decimal is still ok; but if we have 2 in the example (it's a valid schema), we cannot now.
      • Java allows it (correctly), Rust rejects it.
    • If we allow 1., we might also consider it here.

Therefore, we decided to ban it for now. If you have such usage, we'd like to hear what you think!

Example about how logicalType look like

https://avro.apache.org/docs/1.11.1/specification/_print/#logical-types

schema: ["null", {"type":"string","logicalType":"uuid"}]

data: {"string": "67e55044-10b1-426f-9247-bb680e5fe0c8"} (The data is exactly the same as the physical type)

github-actions[bot] commented 1 month ago

This issue has been open for 60 days with no activity.

If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.

You can also confidently close this issue as not planned to keep our backlog clean. Don't worry if you think the issue is still valuable to continue in the future. It's searchable and can be reopened when it's time. 😄