Conceptual pull request

markncooper commented 7 years ago

This pull request shows off various hacks I made to Spark to correctly write data into BigQuery using Avro as the import format. I had to change the Avro format so that:

Objects of TimestampType are encoded as micros (to match BigQuery)
Objects of DateType are encoded as strings (this is a lesser used type that wasn't previously supported

I also pull in code from Appsflyer to build the JSON schema that is used to tell BigQuery how to handle some of the types that are not natively supported by Avro.

richwhitjr commented 7 years ago

Shame that we have to copy the avro-spark completely to get this to work. Would it be reasonable to upstream those two changes to the original project and pull it in as a dependency? The nanosecond change though in particular seems like it would break a lot of existing code if done in that project. That being said looking at the current structure of the Avro project I don't see a good way to extend to only change the conversion of those two types.

Another option, which also isn't great, is to capture the Row prior to write and convert those two fields to Milliseconds or String. The schema for BigQuery could be generated prior to this conversion. This doesn't feel like the correct solution either though.

markncooper commented 7 years ago

Right, I agree it's sort of tricky situation. I'll take a look and see if I can come up with a better route - maybe I could push some upstream flags into the Avro library.

spotify / spark-bigquery

Conceptual pull request #35