spotify / magnolify

A collection of Magnolia add-on modules
https://spotify.github.io/magnolify
Apache License 2.0
168 stars 26 forks source link

Parquet TODO #276

Open nevillelyh opened 3 years ago

nevillelyh commented 3 years ago
nevillelyh commented 3 years ago

Turns out the new 3 level list is more complex.

With the default 2 level list, myField: List[T] is written as:

required group myField (LIST) {
  repeated T array;
}

But the Avro counter part is still "name": "myField", "type": "array", "items": T

While with 3 level list, the Parquet schema becomes:

required group myField (LIST) {
  repeated group list {
    required T element;
  }
}

And the Avro record becomes [{"element": t1}, {"element": t1}]...

WIP in https://github.com/spotify/magnolify/tree/neville/pq-avro

nevillelyh commented 3 years ago

More on Avro array mapping. The following Avro fields

{"name": "field1", "type:" {"type": "array", "items": "string"}, "default": [] } // required array field that defaults to empty array {"name": "field2", "type:" ["null", {"type": "array", "items": "string"}], "default": null } // nullable array field that defaults to null

map to:

required group field1 (LIST) {
  repeated binary array (STRING);
}
optional group field2 (LIST) {
  repeated binary array (STRING);
}