I tested this libraries compatibility with various 'schema evolution' scenarios where I changed the protobuf (added fields, renamed fields, changed optional->repeated, etc) pass it to the ProtoParquetWriter and try to read it back using ProtoParquetRDD. I found that many 'legal' protobuf evolution rules, like field renames or type changes were not compatible with this library.
Now, I don't want to conflate parquet's own schema evolution rules, or spark 'schema merging' capabilities (http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging), but overall it seems technically feasible to have a 'loose' mode where fields that exist in parquet and which have a compatible protobuf type will hydrate the protobuf, regardless of any schema mismatches which makes the process fail hard and fast.
Any thoughts on this?
Examples:
Adding new Enum:
Caused by: org.apache.parquet.io.InvalidRecordException: Illegal enum value
Adding new field:
org.apache.parquet.schema.IncompatibleSchemaModificationException: Cant find "timeout" Scheme mismatch
Hi, this is a great, useful library.
I tested this libraries compatibility with various 'schema evolution' scenarios where I changed the protobuf (added fields, renamed fields, changed optional->repeated, etc) pass it to the ProtoParquetWriter and try to read it back using ProtoParquetRDD. I found that many 'legal' protobuf evolution rules, like field renames or type changes were not compatible with this library.
Now, I don't want to conflate parquet's own schema evolution rules, or spark 'schema merging' capabilities (http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging), but overall it seems technically feasible to have a 'loose' mode where fields that exist in parquet and which have a compatible protobuf type will hydrate the protobuf, regardless of any schema mismatches which makes the process fail hard and fast.
Any thoughts on this?
Examples: Adding new Enum: Caused by: org.apache.parquet.io.InvalidRecordException: Illegal enum value
Adding new field: org.apache.parquet.schema.IncompatibleSchemaModificationException: Cant find "timeout" Scheme mismatch