locationtech-labs / geopyspark

GeoTrellis for PySpark
Other
179 stars 59 forks source link

Unable To spark-submit GeoPySpark Scripts #630

Closed jamesmcclain closed 5 years ago

jamesmcclain commented 6 years ago

This issue was originally reported in the geodocker repository, but I have migrated it here because it appears to be a deeper issue than just configuration.

There is evidently a difference between when/how jars are loaded when a GeoPySpark python script is spark-submited versus when one is run in Jupyter (in which code seems to be piped through a spark-submited pyspark-shell).

A simple test script can succeed when it is run like this

PYSPARK_PYTHON=/usr/bin/python3.4 PYSPARK_DRIVER_PYTHON=/usr/bin/python3.4 spark-submit --jars /opt/jars/geotrellis-backend-assembly-0.3.1.jar /home/hadoop/test.py

but not like this

PYSPARK_PYTHON=/usr/bin/python3.4 PYSPARK_DRIVER_PYTHON=/usr/bin/python3.4 spark-submit /home/hadoop/test.py

Log output from the latter case displays this

18/02/05 22:00:37 INFO SparkContext: Added JAR /opt/jars/geotrellis-backend-assembly-0.3.1.jar at spark://172.31.26.186:45578/jars/geotrellis-backend-assembly-0.3.1.jar with timestamp 1517868037194

indicating that the required jar has been loaded but evidently not at the right time and/or in the right way.