Closed MafcoCinco closed 7 years ago
Anyone had a chance to look at this?
@chrisbetz any thoughts on this?
Sorry, I'm really ashamed of not answering. Hard times with many projects at my day job. :( Sorry!!!
I decided to merge the "other" Parquet-Support-Merge Request (https://github.com/gorillalabs/sparkling/pull/58). Please don't be offended, maybe you need some of that?
Cheers, Chris :)
No worries @chrisbetz! I'm just happy that Parquet has been added as a serialization format! Thanks for taking a look.
Phhuuuhhh. Glad to hear that. Happy hacking :)
Added basic support for
SparkSession
API. The main thing that I needed was support for Parquet serialization and running SQL queries against Parquet-serialized data. I had some trouble getting the map and reduce functions to work onDataFrame
, but that was of a secondary concern to me at this point. I would like to go back and add that support at some point for completeness.As a work around, I included functions to convert back and forth between DataFrame and an RDD with native Clojure types (needed to convert from Row object to a sequence or map). This allowed me to use existing
sparkling.core
functions on the data but I'm sure there is added overhead w.r.t. performing this conversion.Please have a look. Would love to here any and all suggestions on improving the interface or implementation. I'm fairly new to Clojure and sure that some of this stuff is not completely idiomatic. Also, the first time I have used the Java interop, so I'm sure there are places for improvement there as well.
I will be using this in hack week project at work next week so we can see how it works at scale. So far, seems to work well. One things I see for potential user error is in creating the schema to convert back from an RDD to DataFrame. Seems Spark is pretty picky about the typing (got an error that it could not serialize an
int
asLongType
) which was annoying. Might be some room for improvement there.