Data from: https://www.kaggle.com/noaa/noaa-global-surface-summary-of-the-day (Just get the file from the shared drive)
Download spark 2.1.0 Pre-Built for Hadoop 2.7 and later from http://spark.apache.org/downloads.html Move the download .tgz file to some directory (like ~/spark/) then run: tar -xvzf spark-2.1.0-bin-hadoop2.7.tgz
Then put the bin directory inside of the unpackaged spark directory on your PATH. In mac, this can be done by putting the following line in ~/.bash_profile In ubuntu 16.04, put the following line at the end of ~/.bashrc
export PATH=$PATH:/path/to/spark/directory/bin
You'll know this is done right, after restarting your command line, you can tab complete the spark-submit command. Or just type spark-submit into your command line terminal and make sure it prints out the long list of options.
This might be helpful for Windows users: https://www.ics.uci.edu/~shantas/Install_Spark_on_Windows10.pdf https://dzone.com/articles/working-on-apache-spark-on-windows
You'll need to download the gsod_csv.zip from the shared directory. Put that in the data/ directory, cd data (i.e. be in the data directory), and run the unpack_noaa.sh script.
cd data
./unpack_noaa.sh
You also want to
mkdir /tmp/spark-dir/
We can change the directory if that's an issue for windows.
You'll also need:
export M2_HOME=/path/to/maven #(probably /opt/apache-maven-3.6.0)
export PATH=$M2_HOME/bin:$PATH
First, switch to data directory unpack_noaa.sh after downloading gsod_csv.zip from the shared directory and putting it in data.
Then switch back to the cmsc611-project directory and make the spark-stats directory.
mkdir spark-stats
To build the relevant jar, run Maven (the mvn command) using the build.sh script or run:
mvn -DskipTests=true clean install
in the parent directory (cmsc611-project). If maven becomes too much of a pain, we can just commit the jar to the git repo. This should create a target directory in the children directories. For instance, basic-rdd/target/ should exist. Note this target directory will NOT be there if the mvn clean install command did not work.
To run the job locally on your machine, take a look at the relevant run.sh file in the analytic folder. For instance, basic-rdd/run.sh. Here's some info about how the spark-submit command works: https://spark.apache.org/docs/2.1.0/submitting-applications.html#launching-applications-with-spark-submit
To run all of the jobs, run run_all.sh in the main directory.
You'll see that it will set up each spark app name so that it matches the output file for the metrics.
The following pages are helpful to understand the metrics configuration: https://github.com/LucaCanali/sparkMeasure/blob/master/docs/Flight_recorder_mode.md
Note the problem with the flight recorder mode is that it doesn't seem to stop the sparksession. We should explore putting the metrics in the runner, or modifying the code for the sparkMeasure job.
The authors pages:
https://externaltable.blogspot.com/2017/03/on-measuring-apache-spark-workload.html
If it's not there, that's because the port is already taken, probably by a job improperly closed. See spark UI output.
You'll need to create /tmp/spark-events. Need to double check what to do for Windows. For some reason, my mac deleted this at one point, not sure why.
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-history-server.html go to $SPARK_HOME/sbin/start-history-server.sh
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-history-server.html