tabular-io / iceberg-kafka-connect

Apache License 2.0
171 stars 31 forks source link

SMT for json parsing #214

Closed tabmatfournier closed 3 months ago

tabmatfournier commented 3 months ago

Best attempt at turning unstructured json into structs.

If the JSON is extremely inconsistent but you need to get it into Iceberg, configure this to be at transfrom.json.root:true. This will create a single struct with a field named "payload" and a Map<String, String> for the value.

Default configuration (transform.json.root:false) will create a struct where all first-level primitives (int, long, string, etc.) become typed on the struct. Nested objects become Map<string, string> fields to be parsed by the query engine. Arrays of primitives get typed properly, including nested arrays of primitives. Arrays of mixed types get converted to arrays of strings.

Empty nodes, empty arrays, empty objects are stripped from the struct/schema.

Without this, the json schema inference in the connector will infer nested objects as Structs. Inconsistent keys can lead to an explosion of schema evolutions and potentially hundreds to thousands of columns depending on the json. This SMT can be used to avoid that by processing the json and defining a schema that has the nested objects as Maps.

randypitcherii commented 3 months ago

Hey folks -- no rush, but wanted to ask if this was close. Thank you!