twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.5k stars 706 forks source link

Make projected parquet collection schema forward compatible with the given file schema #1921

Open mickjermsurawong-stripe opened 5 years ago

mickjermsurawong-stripe commented 5 years ago
CLAassistant commented 5 years ago

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

:white_check_mark: mickjermsurawong-stripe
:x: joshrosen-stripe
You have signed the CLA already but the status is still pending? Let us recheck it.

mickjermsurawong-stripe commented 5 years ago

cc @moulimukherjee

mickjermsurawong-stripe commented 5 years ago

Hi @johnynek, We ran into a problem where Scalding job fails reading parequet files produced by another Spark job. The problem we discovered is due to projected schema which is generated from Thrift class has different list format than our Spark job.

Just would like to ask for a quick look whether this is the right strategy, and if this is something that we would like to push it to the public one as well.

johnynek commented 5 years ago

Also, you need to bump the .travis.yml jdk setting to openjdk8 since oraclejdk8 I think no longer works on travis.

mickjermsurawong-stripe commented 5 years ago

Thanks @johnynek. Ah I haven't done any tests internally on actual data. Just purely unit tests now. Will follow up with that.

I can port all these to Scala, and will make the resolver stateless; yup the object instantiation isn't neccessary.

mickjermsurawong-stripe commented 5 years ago

hi @isnotinvain or @julienledem, I've addressed the first round of general feedback, and added support for map schema compat as well. Would appreciate your review here.

mickjermsurawong-stripe commented 5 years ago

hi @isnotinvain thank you for the note. Will go through comments by my colleagues and improve docs as you suggested. The test class is in ParquetCollectionFormatForwardCompatibilityTests.scala not sure if that's what you are looking for, but that demonstrates the format conversion here.

mickjermsurawong-stripe commented 5 years ago

@joshrosen-stripe, @tchow-stripe Thanks for the really helpful review. I've simplified many parts of the code, and addressed the PR feedback. Pending issues are figuring out to do end-to-end testing here. @isnotinvain this is ready for another round of review.

mickjermsurawong-stripe commented 5 years ago

Thanks @xton, addressed the feedback. @isnotinvain could you help to take a look on this please?

mickjermsurawong-stripe commented 4 years ago

hi @isnotinvain, would like to follow-up on this if you have more feedback here please.

mickjermsurawong-stripe commented 4 years ago

Hi @isnotinvain, another bump on this please.

Updates from our side: we actually have implemented this in our internally (via subclassing of ParquetScroogeScheme), and it does cover production use cases of read Spark job schema.

Without this patch, scalding job reading Spark job output can either fail hard or falsely read empty collection.