Welcome to the open repository, documentation and materials for the Code4Lib 2018 Spark in the Dark 101 Workshop!
This is an introductory session on Apache Spark, a framework for large-scale data processing. We will introduce high level concepts around Spark, including how Spark execution works and it’s relationship to the other technologies for working with Big Data. Following this introduction to the theory and background, we will walk workshop participants through hands-on usage of spark-shell, Zeppelin notebooks, and Spark SQL for processing library data. The workshop will wrap up with use cases and demos for leveraging Spark within cultural heritage institutions and information organizations, connecting the building blocks learned to current projects in the real world.
This workshop is a registration-only workshop as part of Code4Lib 2018 in Washington, D.C.. For registration information or other conference-level logistical questions, please check the Code4Lib 2018 Registration page. If you have questions about the workshop specifically, you can contact the workshop leaders using the information below.
We ask that all participants come to the workshop ready to dive in by reviewing the information in this document. If you want a sneak peak of the workshop's contents, feel free to also watch this repository (you'll be notified of updates). To watch this repository, sign into GitHub, go this repository's home URL, and click the following button:
Time | Topic | Leader(s) |
---|---|---|
9-9:10 AM | Workshop Introduction, Logistics, Goals (10 minutes) | Christina |
9:10-9:25 AM | Spark Theory: Optimal use cases for Spark (15 minutes) | Audrey |
9:25-9:40 AM | Spark Theory: Spark Architecture (sparkitecture?) (15 minutes) | Michael |
9:40-9:55 AM | Spark Theory: RDD vs. DataFrame APIs (15 minutes) | Scott |
9:55-10:15 AM | Spark Practice: Env/setup (20 minutes) | Mark & Justin |
10:15-10:30 AM | break (15 minutes) | n/a |
10:30-10:50 AM | Spark Practice: Working with spark-shell (20 minutes) | Christina & Audrey |
10:50-11:10 AM | Spark Practice: Working with zeppelin (20 minutes) | Christina & Audrey |
11:10-11:45 AM | Spark Practice: Interacting with Real World Data (20 minutes) | Whole Group |
11:45-Noon | Examples & Wrap-Up (30 minutes) | Whole Group |
If you have questions or concerns leading up to or after the workshop, please open an issue on this GitHub repository, particularly with any questions dealing with workshop preparation or any installation issues. This allows multiple workshop leaders to respond as able, and other participants can also learn (since we're sure the same questions will come up multiple times): https://github.com/spark4lib/code4lib2018/issues (this will require that you login or create a free account with GitHub).
During the workshop, we will indicate the best ways to get help or communicate a question/comment - however, this workshop is intended to be informal, so feel free to speak up or indicate you have a question at any time.
To keep this workshop a safe and inclusive space, we ask that you review and follow the Code4Lib 2018 Code of Conduct and the Recurse Center Social Rules (aka Hacker School Rules).
We request that all participants:
We will be sending out an email with the specific Docker image information before Monday.
If you have any issues with the above, please contact us ASAP using the communication methods detailed above.
git pull origin master
) or download it, and make sure you have the latest copy.docker pull mbdpla/sparkworkshop:latest
docker run -p 8080:8080 -v $PWD:/code4lib2018 -e ZEPPELIN_NOTEBOOK_DIR='/code4lib2018/notebooks' mbdpla/sparkworkshop:latest
This should download and start up our Zeppelin Docker image on your machine. Check if it is running by opening and web browser and going to http://localhost:8080. This should show Zeppelin Notebook homepage with 2 notebooks loaded. It will save Notebooks directly to your GitHub repository directory, so be aware of that!
We recommend re-pulling both this repository & the docker image Monday evening, if possible, to make sure you get the latest representation of the work.
The day of, we will also bring thumbdrives with our workshop Docker image on it.
Either way, you’ll need to pull the data & notebooks from this Github repository.