infinyon / fluvio

Lean and mean distributed stream processing system written in rust and web assembly. Alternative to Kafka + Flink in one.
https://www.fluvio.io/
Apache License 2.0
3.81k stars 497 forks source link

[Feature] Schema management functionality on streams for InfinyOn Cloud and Fluvio Topics #3267

Open drc-infinyon opened 1 year ago

drc-infinyon commented 1 year ago

Related RFC: https://github.com/infinyon/fluvio/issues/3081

Summary

This product brief describe the need for schema management functionality. There are folks in our developer community who have asked if we support functionality similar to the Kafka Schema registry. This document will describew the problem space and the functionality needed to serve InfinyOn Customer.

Opportunity

Schema is an essential input to implementing and maintaining data contracts and data quality. Majority of the data world operate on defined schemas and data models. The ability to implement a schema on the topics will enable my different features including enabling time window based aggregation and matrialized views which relies on a tabular structure.

Target audience

Schema management will be relevant for InfinyOn Cloud Developers as well as analysts to implement a schema configuration in their data flows.

Customer Insights

Among our current user feedback, we have an IoT company who described their need for schemas.

They receive data from sensors which are made and deployed by different vendors and they send similar payloads with differences in the attribute names, metric systems of dimensions. These differences need to be reconciled in the process of cleanup. Below is 5 minutes of the customer describing the use case.

Another consumption pattern shared by a SaaS company developing usage based billing who receives consumption data from their users and provides them the capability of billing and invoicing.

Experience

Currently, users may have a wide range of experiences with the schema given that schema is handled differently in different systems like databases or streaming tools like Kafka.

As we consider the experience of how the schema management would look like for the InfinyOn Cloud user we need to be informed by the data sources, the payload, and the consumption patterns.

For instance, if we are looking at semi-structured data from web pages, RSS feeds, clickstream we would expect XML, JSON inputs. As we consider the consumption patterns and the serialization deserialization requirements, we have come across customers and prospects who use Avro, Protobuf as serialization patterns and the data gets store in a flavour of Parquet like Hudi or iceberg or other optimized columnar formats like arrow.

Now the schema provides the ability to model semi-structured data in a tabular model, which enables the ability to perform aggregation, create derived columns, and model the data for analytical workflows.

For InfinyOn customers, we need to enable a schema management on the data collected from the edge to generate alerts on schema change or issues with the payload from the source and dynamic computation using smart modules based on attribute values.

Acceptance Criteria

Competitive Insights

  1. Confluent Schema Registry: https://docs.confluent.io/platform/current/schema-registry/index.html
  2. Slalom schema registry introduction: https://medium.com/slalom-technology/introduction-to-schema-registry-in-kafka-915ccf06b902
  3. Confluent Schema Registry 101, Avro, JSON: https://youtu.be/ovIsHhIrie8

Interface

Configuration

Schema configuration example applied to topic:

*schema-config.yaml*
meta:
  name: column-schema-1
  version: 1.0 # semver expected
  # schema names a smart module conforming to a smart module schema interface
    schema-provider: infinyon/avro-schema@0.1.0 # alternative include column, protobuf, parquet, arrow

# spec is a user defined custom specification string, the schema does not parse the spec is passed to the schema smartmodule
# as a opaque string
spec: |
    - name: fruit_id
      key: true
      type: integer
    - name: fruit_name
      type: string
    - name: fruit_color
      type: string

CLI

CLI Commands concept

fluvio schema create

fluvio schema list

fluvio schema describe SCHEMA_NAME[@VERSION]

fluvio schema apply SCHEMA_NAME TOPIC_NAME

fluvio schema remove SCHEMA_NAME TOPIC_NAME

fluvio schema delete

fluvio schema disable SCHEMA_NAME@VERSION

fluvio schema create --config schema-config.yaml
ajhunyady commented 1 year ago

@drc-infinyon, as per our conversation, the schema should be applied at the topic level. Do you have the notes or a pick from the whiteboard session?

fluvio topic create <name> --config <config with schema definition>
fluvio topic apply <name> --config <config with schema definition>