mesosphere / spark-build

Used to build the mesosphere/spark docker image and the DC/OS Spark package
52 stars 34 forks source link

Cannot specify Python version when launching PySpark jobs #73

Open mooperd opened 7 years ago

mooperd commented 7 years ago

Whilst trying to use python3 as the PySpark driver I have found that PYSPARK_PYTHON & PYSPARK_DRIVER_PYTHON seem to be ignored when launching jobs using the dcos spark tool.

According to the spark documentation "SPARK_HOME/conf/spark-env.sh can be used to set various variables when launching spark jobs on mesos. - http://spark.apache.org/docs/latest/configuration.html#environment-variables

I have copied ~/.dcos/spark/dist/spark-2.0.0/conf/spark-env.sh.template to ~/.dcos/spark/dist/spark-2.0.0/conf/spark-env.sh

and added the lines:

export PYSPARK_PYTHON = python3
export PYSPARK_DRIVER_PYTHON = python3

I have also tried the following in the shell before running jobs:

export PYSPARK_PYTHON = python3
export PYSPARK_DRIVER_PYTHON = python3

Also putting these directly in the spark-submit shell script does not work which brings me to the conclusion that these environment variables are being stripped out somewhere. I don't see any errors anywhere.

I'm testing the python version with:

version = sys.version
log4jLogger = sc._jvm.org.apache.log4j
LOGGER = log4jLogger.LogManager.getLogger(__name__)
LOGGER.info("pyspark script logger initialized")
LOGGER.info("Python Version: " + version)
mgummelt commented 7 years ago

~/.dcos/spark/dist contains your local distribution of Spark, but it has no effect on the driver nor executors, which all run inside the docker image in the cluster. You'd have to modify spark-env.sh in the docker image.

mooperd commented 7 years ago

@mgummelt These seems like a somewhat backwards way to use spark. Typically one should have control over these variables when starting jobs.

I think this would mean that spark applications are not easy to port to DC/OS

mgummelt commented 7 years ago

Can you give me an example of how you would set these outside of DC/OS?

When submitting in cluster mode, I'm not aware of any other system (YARN, Standalone) that forwards along those environment variables to the driver.

mooperd commented 7 years ago

It is common to switch your version of python using the PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables and I use this often. Its most common when testing between python2 and python3 however in one specific case I have seen 3 different versions of Anaconda Python installed on a Hadoop cluster with different dependancies and custom modules set up.

The spark documentation also tells that these variables should be controllable when using spark-submit - http://spark.apache.org/docs/latest/configuration.html#environment-variables - But its a bit confusing as yarn seems to have about three hundred submission modes.

From the cloudera is the following passage:

In the best possible world, you have a good relationship with your local sysadmin and they are able and willing to set up a virtualenv or install the Anaconda distribution of Python on every node of your cluster, with your required dependencies. If you are a data scientist responsible for administering your own cluster, you may need to get creative about setting up your required Python environment on your cluster. If you have sysadmin or devops support for your cluster, use it!

http://blog.cloudera.com/blog/2015/09/how-to-prepare-your-apache-hadoop-cluster-for-pyspark-jobs/

Test on AWS EMR

test.py

import sys
from pyspark import SparkContext

sc = SparkContext()
version = sys.version
log4jLogger = sc._jvm.org.apache.log4j
LOGGER = log4jLogger.LogManager.getLogger(__name__)
LOGGER.info("pyspark script logger initialized")
LOGGER.info("Python Version: " + version)
sc.stop()
exit()

Run without VARS set

[hadoop@ip-10-141-1-236 ]$ spark-submit test.py
</snip>
16/10/19 20:47:16 INFO __main__: Python Version: 2.7.10 (default, Jul 20 2016, 20:53:27)
</snip>

Run with VARS set

[hadoop@ip-10-141-1-236 ]$ export PYSPARK_PYTHON=python3.4
[hadoop@ip-10-141-1-236 ]$ export PYSPARK_DRIVER_PYTHON=python3.4
[hadoop@ip-10-141-1-236 ]$ spark-submit test.py
</snip>
16/10/19 20:49:47 INFO __main__: Python Version: 3.4.3 (default, Jul 20 2016, 21:31:36) 
</snip>
mooperd commented 7 years ago

@mgummelt - could we reopen this issue?

jstremme commented 4 years ago

I came across this recently when using AWS EMR and was able to get set up with a Python 3.6.8 driver to match the version of my worker nodes with the following steps after SSHing into the master node:

`

Update package manager

sudo yum update

Install Anaconda - you may need to close and reopen your shell after this

wget https://repo.continuum.io/archive/Anaconda3-2019.10-Linux-x86_64.sh sh Anaconda3-2019.10-Linux-x86_64.sh

Create virtual environment

conda create -n py368 python=3.6.8 source activate py368

Install Python packages

pip install --user jupyter pip install --user ipython pip install --user ipykernel pip install --user numpy pip install --user pandas pip install --user matplotlib pip install --user scikit-learn

Create notebook kernel

python -m ipykernel install --user --name py368 --display-name "Python 3.6.8"

Pull repo

sudo yum install git git clone https://github.com/jstremme/DATA512-Research.git

PYSPARK Configuration

export PYTHONPATH="/home/hadoop/.local/lib/python3.6/site-packages:$PYTHONPATH" export PYSPARK_DRIVER_PYTHON=/home/hadoop/.local/bin/jupyter export PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser' export PYSPARK_PYTHON=/usr/bin/python3 echo $PYTHONPATH echo $PYSPARK_DRIVER_PYTHON echo $PYSPARK_DRIVER_PYTHON_OPTS pyspark `