jcboyd / spark-demo

Demo use case for Apache Spark
0 stars 0 forks source link

spark-demo

Apache Spark is a cluster computing framework that runs atop distributed storage software HDFS (among others), but which offers substantial performance improvement over Hadoop. To install Spark, first ensure Hadoop is installed on your system. A demo for configuring Hadoop on OS X is given on https://github.com/jcboyd/hadoop-demo. Spark can be installed with

$ brew install spark

by which it will be placed in /usr/local/Cellar/apache-spark. Programming in Spark centers on data structures known as Resilient Distributed Datasets (RDD), which are replicated over the cluster. Spark is written primarily in Scala and Java and comes with shell interfaces for Scala, Python, and R. For example, launch the Scala shell with

$ /usr/local/Cellar/apache-spark/1.6.0/bin/spark-shell

Functional language Scala lends itself well to Spark. A MapReduce operation can be accomplished spectacularly with

val textFile = sc.textFile("file:///usr/local/Cellar/apache-spark/1.6.0/README.md")
val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)

A number of data processing libraries are provided with the distribution, notably mllib, a machine learning library. A linear regression demo using this library is given in LinearRegression.scala. This can be compiled with Scala build tool sbt

$ brew install sbt
$ sbt package

The model can then be run with the command

$ /usr/local/Cellar/apache-spark/1.6.0/bin/spark-submit --class "LinearRegression" --master local[4] target/scala-2.10/spark-demo_2.10-1.0.jar