gorillalabs / sparkling

A Clojure library for Apache Spark: fast, fully-features, and developer friendly
https://gorillalabs.github.io/sparkling/
Eclipse Public License 1.0
448 stars 68 forks source link

Added basic support for SparkSession API #57

Closed MafcoCinco closed 7 years ago

MafcoCinco commented 7 years ago

Added basic support for SparkSession API. The main thing that I needed was support for Parquet serialization and running SQL queries against Parquet-serialized data. I had some trouble getting the map and reduce functions to work on DataFrame, but that was of a secondary concern to me at this point. I would like to go back and add that support at some point for completeness.

As a work around, I included functions to convert back and forth between DataFrame and an RDD with native Clojure types (needed to convert from Row object to a sequence or map). This allowed me to use existing sparkling.core functions on the data but I'm sure there is added overhead w.r.t. performing this conversion.

Please have a look. Would love to here any and all suggestions on improving the interface or implementation. I'm fairly new to Clojure and sure that some of this stuff is not completely idiomatic. Also, the first time I have used the Java interop, so I'm sure there are places for improvement there as well.

I will be using this in hack week project at work next week so we can see how it works at scale. So far, seems to work well. One things I see for potential user error is in creating the schema to convert back from an RDD to DataFrame. Seems Spark is pretty picky about the typing (got an error that it could not serialize an int as LongType) which was annoying. Might be some room for improvement there.

MafcoCinco commented 7 years ago

Anyone had a chance to look at this?

MafcoCinco commented 7 years ago

@chrisbetz any thoughts on this?

chrisbetz commented 7 years ago

Sorry, I'm really ashamed of not answering. Hard times with many projects at my day job. :( Sorry!!!

I decided to merge the "other" Parquet-Support-Merge Request (https://github.com/gorillalabs/sparkling/pull/58). Please don't be offended, maybe you need some of that?

Cheers, Chris :)

MafcoCinco commented 7 years ago

No worries @chrisbetz! I'm just happy that Parquet has been added as a serialization format! Thanks for taking a look.

chrisbetz commented 7 years ago

Phhuuuhhh. Glad to hear that. Happy hacking :)