thedataincubator / spark-structured-streaming

A short course on the new, experimental features by The Data Incubator and O'Reilly Strata.
http://shop.oreilly.com/product/0636920057482.do
16 stars 12 forks source link

Spark Structured Streaming

A short course on the new, experimental features by The Data Incubator and O'Reilly Strata. You can purchase the accompanying videos here on the O'Reilly website.

Installation

To run this tutorial, you need Apache Spark and Jupyter. You can install them:

  1. Download and install Apache Spark 2.0.0 by following the instructions here. You may first have to install Hadoop.
  2. Install Jupyter
    pip install jupyter

Optional

To be able to run the interactive code cells, create a toree kernel:

jupyter toree install --spark_opts='--master=local[2] --executor-memory 4g --driver-memory 4g' \
    --kernel_name=apache_toree --interpreters=PySpark,SparkR,Scala,SQL --spark_home=$SPARK_HOME

Otherwise, you can copy and paste the cells into a spark shell, which you can start by running

make spark-shell

Starting the Course

To start the course, run

make notebook

and open the Overview.ipynb notebook. Note that you may be at a higher port number if 9000 is already in use.

If you want to play with Spark directly, you can also run

make spark-shell

Credits: The spark project template is based on https://github.com/nfo/spark-project-template