nichollsbr / cmsc611-project

0 stars 0 forks source link


Data from: (Just get the file from the shared drive)

Download spark 2.1.0 Pre-Built for Hadoop 2.7 and later from Move the download .tgz file to some directory (like ~/spark/) then run: tar -xvzf spark-2.1.0-bin-hadoop2.7.tgz

Then put the bin directory inside of the unpackaged spark directory on your PATH. In mac, this can be done by putting the following line in ~/.bash_profile In ubuntu 16.04, put the following line at the end of ~/.bashrc

export PATH=$PATH:/path/to/spark/directory/bin

You'll know this is done right, after restarting your command line, you can tab complete the spark-submit command. Or just type spark-submit into your command line terminal and make sure it prints out the long list of options.

This might be helpful for Windows users:

You'll need to download the from the shared directory. Put that in the data/ directory, cd data (i.e. be in the data directory), and run the script.

cd data

You also want to

mkdir /tmp/spark-dir/

We can change the directory if that's an issue for windows.

You'll also need:


First, switch to data directory after downloading from the shared directory and putting it in data.

Then switch back to the cmsc611-project directory and make the spark-stats directory.

mkdir spark-stats

To build the relevant jar, run Maven (the mvn command) using the script or run:

mvn -DskipTests=true clean install

in the parent directory (cmsc611-project). If maven becomes too much of a pain, we can just commit the jar to the git repo. This should create a target directory in the children directories. For instance, basic-rdd/target/ should exist. Note this target directory will NOT be there if the mvn clean install command did not work.

To run the job locally on your machine, take a look at the relevant file in the analytic folder. For instance, basic-rdd/ Here's some info about how the spark-submit command works:

To run all of the jobs, run in the main directory.

You'll see that it will set up each spark app name so that it matches the output file for the metrics.

The following pages are helpful to understand the metrics configuration:

Note the problem with the flight recorder mode is that it doesn't seem to stop the sparksession. We should explore putting the metrics in the runner, or modifying the code for the sparkMeasure job.

The authors pages:

Spark UI for Active Job


If it's not there, that's because the port is already taken, probably by a job improperly closed. See spark UI output.

To Start History Server

You'll need to create /tmp/spark-events. Need to double check what to do for Windows. For some reason, my mac deleted this at one point, not sure why. go to $SPARK_HOME/sbin/


Useful Stuff