ian-whitestone / pyspark-vs-dask

[WIP] Comparing pyspark and dask for speed, memory/CPU usage, and ease of use
2 stars 1 forks source link

Update user_data.sh #4

Closed ian-whitestone closed 5 years ago

ian-whitestone commented 5 years ago

Automate setup of dask/spark


## Dask Environment
conda create -n dask python=3.6 -y -q
conda activate dask
conda install dask -y
conda install s3fs -c conda-forge -y # dependency for reading S3 files
conda install fastavro -y
conda deactivate

# Spark Environment

## Install Java
sudo apt-get update
sudo apt-get install default-jre -y

## Spark Conda Env
conda create -n spark python=3.6 -y -q
conda activate spark
conda install -y conda=4.3.30
conda install -y pypandoc=1.4 py4j=0.10.7
conda install -y pandas # in order to convert spark dataframes back to pandas

## Download Stuff
wget http://repo1.maven.org/maven2/com/databricks/spark-avro_2.11/4.0.0/spark-avro_2.11-4.0.0.jar .
wget https://www.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-without-hadoop.tgz
wget https://archive.apache.org/dist/hadoop/core/hadoop-2.9.1/hadoop-2.9.1.tar.gz
tar -xvzf hadoop-2.9.1.tar.gz && rm -f hadoop-2.9.1.tar.gz
tar -xvzf spark-2.3.1-bin-without-hadoop.tgz && rm -f spark-2.3.1-bin-without-hadoop.tgz

## Install pyspark
cd spark-2.3.1-bin-without-hadoop/python
python setup.py install
conda deactivate

## Update bashrc
echo "#### SPARK CONFIGURATIONS ####" >> ~/.bashrc
echo "## JAVA_HOME" >> ~/.bashrc
echo 'export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")' >> ~/.bashrc
echo "## add hadoop to path" >> ~/.bashrc
echo 'export PATH=//home/ubuntu/hadoop-2.9.1/bin:$PATH' >> ~/.bashrc
echo "## add spark_to_path" >> ~/.bashrc
echo 'export PATH=//home/ubuntu/spark-2.3.1-bin-without-hadoop/sbin://home/ubuntu/spark-2.3.1-bin-without-hadoop/bin:$PATH' >> ~/.bashrc
echo "## spark distribution classpath" >> ~/.bashrc
echo 'export SPARK_DIST_CLASSPATH=$(hadoop classpath)://home/ubuntu/hadoop-2.9.1/share/hadoop/tools/lib/*://home/ubuntu/spark-avro_2.11-4.0.0.jar' >> ~/.bashrc
echo "## python fixing for conda environments" >> ~/.bashrc
echo 'export PYTHONPATH=//home/ubuntu/spark-2.3.1-bin-without-hadoop/python/lib/py4j-0.10.7-src.zip:/opt/spark-2.3.1-bin-without-hadoop/python' >> ~/.bashrc
echo "#### END SPARK CONFIGURATIONS ####" >> ~/.bashrc