nodestream-proj / nodestream

A Declarative framework for Building, Maintaining, and Analyzing Graph Data
https://nodestream-proj.github.io/docs/
Apache License 2.0
37 stars 11 forks source link

[REQUEST] Record Schema Inference and Enforcement #37

Open zprobst opened 1 year ago

zprobst commented 1 year ago

Is your feature request related to a problem? Please describe. When you aggregate data from many data sources contributed by many teams, its possible to have a schema that is changed underneath you. When this happens and you run with a TTL system, its possible to only notice this when things expire.

Describe the solution you'd like

Basic API

Introduce aFilter type that is capable of inferring and enforcing the record schema.

filters:
 - implementation: nodestream.filters:SchemaEnforcement
   arguments:
      mode: "ENFORCE" # One of ENFORCE, WARN, INFER
      storage:
         location: s3
         bucket: my-awesome-s3-schema-bucket
         key: schemas/pipelines/my-schema-for-this-cool-pipeline.json
      inference: # only used when mode is INFER
         sample_size: 10000
Implementation Details

Describe alternatives you've considered N/A

Additional context N/A

angelosantos4 commented 2 months ago

After 1.0 This is a non-breaking change.