Fix nullpointer when reading parquet using scrooge thrift which was not written with metadata

twitter / scalding

A Scala API for Cascading

http://twitter.com/scalding

Apache License 2.0

3.48k stars 704 forks source link

Fix nullpointer when reading parquet using scrooge thrift which was not written with metadata #1888

Closed moulimukherjee closed 5 years ago

moulimukherjee commented 5 years ago

Getting descriptor from the thriftClass if metadata is not present.

At Stripe we have both spark and scalding jobs. When using scalding to read parquet written by a spark job, it throws a NullPointer since spark does not write the thrift metadata. This PR addresses that.

java.lang.NullPointerException at org.apache.parquet.hadoop.thrift.ThriftReadSupport.prepareForRead(ThriftReadSupport.java:246)

r? @ianoc @johnynek

johnynek commented 5 years ago

can we upload a file that shows the bug and show that this fixes it?

We may be able to do it in a job with a test using the parque tuple support to write a compatibly schema, then read it back using scrooge and see it not fail.

isnotinvain commented 5 years ago

It's been a while since I looked at this, but, generally, you can't use the parquet-thrift integration to read a parquet file unless it was written with parquet-thrift. That metadata that's missing is important, it's how we reconcile schema evolution. When that metadata is present, parquet can support all the schema evolutions that thrift does (renaming enums, changing optionality of fields, renaming fields, etc).

Safely reading parquet data always involves 3 schemas unfortunately: 1) The write-time schema (what schema was used to write the data) 2) The read-time schema (what schema is being used to read the data) 3) The projected read-time schema (for column projections, which needs to internally turn into a projected write-time schema).

I think if you don't find the write-time schema, and just assume it's the read-time schema, then it'll work so long as you never evolve your thrift schema, including changing field names or enum values.

moulimukherjee commented 5 years ago

@johnynek Ack, I'll add that. Can you point me to code which does something similar?

@isnotinvain From our use cases, the schema when reading is exactly the same, just that it's written by a spark job. If the schema has changed, it's acceptable that it fails. If there's no change though, falling back to read-time schema would help address that.

moulimukherjee commented 5 years ago

Added tests. Test failures without the fix in ScroogeReadSupport https://travis-ci.org/twitter/scalding/builds/464009398?utm_source=github_status&utm_medium=notification

johnynek commented 5 years ago

@moulimukherjee nice work!

@dieu @ttim any concerns?

isnotinvain commented 5 years ago

Might be a good idea to log that no thrift metadata was found and we're falling back on the read time schema. I don't love the idea of sort of "best effort attempts" to read the data, it leads to a lot of subtle bugs when the schema changes (and it almost always winds up changing). It's not guaranteed to fail in those cases, sometimes you just get surprising results (like fields that are populated are returned as None, due to field renames) or things like that.

Maybe we could put this behavior behind a flag / setting. I don't want to block you from getting something that works, but I do think we should worry about the correctness. Strictly speaking, assuming the read schema is OK to use in place of the write schema is not correct.

ianoc commented 5 years ago

@isnotinvain you sure this isn’t just a bug in the Scrooge one ? It does an NPE now if the metadata isn’t present. The code is designed if I’m reading it right to do the fallback for the java one. So I think the java one already falls back

isnotinvain commented 5 years ago

@ianoc looks like you're right, I haven't looked at this stuff in a long time.

https://github.com/apache/parquet-mr/blob/master/parquet-thrift/src/main/java/org/apache/parquet/hadoop/thrift/ThriftReadSupport.java#L239

Matching that behavior seems ok with me. It does mean you lose all the schema evolution support (quietly), which is not the best, but I guess we've already decided that's OK elsewhere.

As an unrelated aside, it'd be a good idea to get the writer to write the metadata, that way you get the schema evolution support.

isnotinvain commented 5 years ago

Actually, do we know why the null case isn't being handled? this extends ThriftReadSupport which seems like it's handling it in the link i provided above ^

ianoc commented 5 years ago

I’m not on a laptop to check for sure. But a bunch of that is TBase specific I think is the problem. It checks if the class is a subtype.

We almost never want to totally rely on field numbers for backward compatibility since things like spark sql/presto or similar won’t follow that. So I don’t think it’s too big of an issue for most users as a result

Big sources of this come up if you write your parquet from presto/spark/otherparquet. You can’t read it simply via the thrift sources if you specify that as your schema.

moulimukherjee commented 5 years ago

Yup, the ThriftMetadata.fromThriftClass is TBase Specific https://github.com/apache/parquet-mr/blob/master/parquet-thrift/src/main/java/org/apache/parquet/thrift/ThriftMetaData.java#L108

That would return null in scrooge's case.

isnotinvain commented 5 years ago

OK, sounds good to me!