saurfang / sparksql-protobuf

Read SparkSQL parquet file as RDD[Protobuf]
http://spark-packages.org/package/saurfang/sparksql-protobuf
Apache License 2.0
93 stars 36 forks source link

"Loose" Schema Mode #4

Open michaelmoss opened 7 years ago

michaelmoss commented 7 years ago

Hi, this is a great, useful library.

I tested this libraries compatibility with various 'schema evolution' scenarios where I changed the protobuf (added fields, renamed fields, changed optional->repeated, etc) pass it to the ProtoParquetWriter and try to read it back using ProtoParquetRDD. I found that many 'legal' protobuf evolution rules, like field renames or type changes were not compatible with this library.

Now, I don't want to conflate parquet's own schema evolution rules, or spark 'schema merging' capabilities (http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging), but overall it seems technically feasible to have a 'loose' mode where fields that exist in parquet and which have a compatible protobuf type will hydrate the protobuf, regardless of any schema mismatches which makes the process fail hard and fast.

Any thoughts on this?

Examples: Adding new Enum: Caused by: org.apache.parquet.io.InvalidRecordException: Illegal enum value

Adding new field: org.apache.parquet.schema.IncompatibleSchemaModificationException: Cant find "timeout" Scheme mismatch

IceMan81 commented 7 years ago

@michaelmoss I'm looking to use this library - did you use this with proto2 or proto3?

michaelmoss commented 7 years ago

@IceMan81, proto 2.6 - I think the parquet project may have just added support for proto3 itself.