Closed moulimukherjee closed 5 years ago
can we upload a file that shows the bug and show that this fixes it?
We may be able to do it in a job with a test using the parque tuple support to write a compatibly schema, then read it back using scrooge and see it not fail.
It's been a while since I looked at this, but, generally, you can't use the parquet-thrift integration to read a parquet file unless it was written with parquet-thrift. That metadata that's missing is important, it's how we reconcile schema evolution. When that metadata is present, parquet can support all the schema evolutions that thrift does (renaming enums, changing optionality of fields, renaming fields, etc).
Safely reading parquet data always involves 3 schemas unfortunately: 1) The write-time schema (what schema was used to write the data) 2) The read-time schema (what schema is being used to read the data) 3) The projected read-time schema (for column projections, which needs to internally turn into a projected write-time schema).
I think if you don't find the write-time schema, and just assume it's the read-time schema, then it'll work so long as you never evolve your thrift schema, including changing field names or enum values.
@johnynek Ack, I'll add that. Can you point me to code which does something similar?
@isnotinvain From our use cases, the schema when reading is exactly the same, just that it's written by a spark job. If the schema has changed, it's acceptable that it fails. If there's no change though, falling back to read-time schema would help address that.
Added tests. Test failures without the fix in ScroogeReadSupport https://travis-ci.org/twitter/scalding/builds/464009398?utm_source=github_status&utm_medium=notification
@moulimukherjee nice work!
@dieu @ttim any concerns?
Might be a good idea to log that no thrift metadata was found and we're falling back on the read time schema. I don't love the idea of sort of "best effort attempts" to read the data, it leads to a lot of subtle bugs when the schema changes (and it almost always winds up changing). It's not guaranteed to fail in those cases, sometimes you just get surprising results (like fields that are populated are returned as None, due to field renames) or things like that.
Maybe we could put this behavior behind a flag / setting. I don't want to block you from getting something that works, but I do think we should worry about the correctness. Strictly speaking, assuming the read schema is OK to use in place of the write schema is not correct.
@isnotinvain you sure this isn’t just a bug in the Scrooge one ? It does an NPE now if the metadata isn’t present. The code is designed if I’m reading it right to do the fallback for the java one. So I think the java one already falls back
@ianoc looks like you're right, I haven't looked at this stuff in a long time.
Matching that behavior seems ok with me. It does mean you lose all the schema evolution support (quietly), which is not the best, but I guess we've already decided that's OK elsewhere.
As an unrelated aside, it'd be a good idea to get the writer to write the metadata, that way you get the schema evolution support.
Actually, do we know why the null case isn't being handled? this extends ThriftReadSupport
which seems like it's handling it in the link i provided above ^
I’m not on a laptop to check for sure. But a bunch of that is TBase specific I think is the problem. It checks if the class is a subtype.
We almost never want to totally rely on field numbers for backward compatibility since things like spark sql/presto or similar won’t follow that. So I don’t think it’s too big of an issue for most users as a result
Big sources of this come up if you write your parquet from presto/spark/otherparquet. You can’t read it simply via the thrift sources if you specify that as your schema.
Yup, the ThriftMetadata.fromThriftClass
is TBase
Specific
https://github.com/apache/parquet-mr/blob/master/parquet-thrift/src/main/java/org/apache/parquet/thrift/ThriftMetaData.java#L108
That would return null in scrooge's case.
OK, sounds good to me!
Getting descriptor from the thriftClass if metadata is not present.
At Stripe we have both spark and scalding jobs. When using scalding to read parquet written by a spark job, it throws a NullPointer since spark does not write the thrift metadata. This PR addresses that.
java.lang.NullPointerException at org.apache.parquet.hadoop.thrift.ThriftReadSupport.prepareForRead(ThriftReadSupport.java:246)
r? @ianoc @johnynek