locationtech / rasterframes

Geospatial Raster support for Spark DataFrames
http://rasterframes.io
Apache License 2.0
243 stars 46 forks source link

In RF Notebook docker image, utils.create_rf_spark_session fails with`import rtree` #452

Closed vpipkt closed 4 years ago

vpipkt commented 4 years ago

As originally reported by @mjgolebiewski in the s22s/rasterframes-notebook:0.8.5 image (cbc6ce228c8e), the following code results in a runtime error

import geopandas
from pyrasterframes.utils import create_rf_spark_session
spark = create_rf_spark_session()

Implicitly import geopandas is defining some version checking utilities that import rtree which at version 0.9.0 and 0.9.1 used literal semicolon instead of os.pathsep

Error details:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-1-d7e73cfb9fde> in <module>
     28 
     29 
---> 30 spark = create_rf_spark_session()
     31 
     32 

/opt/conda/lib/python3.7/site-packages/pyrasterframes/utils.py in create_rf_spark_session(master, **kwargs)
     93              .config('spark.jars', jar_path)
     94              .withKryoSerialization()
---> 95              .config(conf=conf)  # user can override the defaults
     96              .getOrCreate())
     97 

/usr/local/spark/python/pyspark/sql/session.py in getOrCreate(self)
    171                     for key, value in self._options.items():
    172                         sparkConf.set(key, value)
--> 173                     sc = SparkContext.getOrCreate(sparkConf)
    174                     # This SparkContext may be an existing one.
    175                     for key, value in self._options.items():

/usr/local/spark/python/pyspark/context.py in getOrCreate(cls, conf)
    365         with SparkContext._lock:
    366             if SparkContext._active_spark_context is None:
--> 367                 SparkContext(conf=conf or SparkConf())
    368             return SparkContext._active_spark_context
    369 

/usr/local/spark/python/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
    131                     " note this option will be removed in Spark 3.0")
    132 
--> 133         SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
    134         try:
    135             self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,

/usr/local/spark/python/pyspark/context.py in _ensure_initialized(cls, instance, gateway, conf)
    314         with SparkContext._lock:
    315             if not SparkContext._gateway:
--> 316                 SparkContext._gateway = gateway or launch_gateway(conf)
    317                 SparkContext._jvm = SparkContext._gateway.jvm
    318 

/usr/local/spark/python/pyspark/java_gateway.py in launch_gateway(conf)
     44     :return: a JVM gateway
     45     """
---> 46     return _launch_gateway(conf)
     47 
     48 

/usr/local/spark/python/pyspark/java_gateway.py in _launch_gateway(conf, insecure)
    106 
    107             if not os.path.isfile(conn_info_file):
--> 108                 raise Exception("Java gateway process exited before sending its port number")
    109 
    110             with open(conn_info_file, "rb") as info:

Exception: Java gateway process exited before sending its port number

And in the stdout of the notebook I see: /usr/bin/env: ‘bash’: No such file or directory. This is a subtle point and I think important.

https://gitter.im/locationtech/rasterframes?at=5e2596be5b81ab262e5ae82b

vpipkt commented 4 years ago

Reproducing in the notebook may center around the python subprocess.Popen call at pyspark/java_gateway.py:98

Here's what I extracted from debugger. NOte the env['_PYSPARK_DRIVER_CONN_INFO_PATH'] is created at run time then i think removed by the spark-submit call.

The only place we see reference to bash is in the env['SHELL']= '/bin/bash'. There is no explicit reference to usr/bin/env in the arguments here. The actual script / binary spark-submit has the hashbang line: #!/usr/bin/env bash... But not sure what about the geopandas import has caused this to happen.

from subprocess import Popen, PIPE
import signal

command = ['/usr/local/spark/./bin/spark-submit', '--conf', 'spark.master=local[*]', '--conf', 'spark.app.name=RasterFrames', '--conf', 'spark.jars=/opt/conda/lib/python3.7/site-packages/pyrasterframes/jars/pyrasterframes-assembly-0.8.5.jar', '--conf', 'spark.serializer=org.apache.spark.serializer.KryoSerializer', '--conf', 'spark.kryo.registrator=org.locationtech.rasterframes.util.RFKryoRegistrator', '--conf', 'spark.kryoserializer.buffer.max=500m', 'pyspark-shell']

env = {'LC_ALL': 'en_US.UTF-8', 'LD_LIBRARY_PATH': ':/opt/conda/lib', 
       'APACHE_SPARK_REMOTE_PATH': 'spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz', 
       'LANG': 'en_US.UTF-8', 'HOSTNAME': '37a98020404f', 'NB_UID': '1000', 
       'CONDA_DIR': '/opt/conda', 'CONDA_VERSION': '4.7.12', 'PWD': '/home/jovyan', 
       'HOME': '/home/jovyan', 'MINICONDA_MD5': '1c945f2b3335c7b2b15130b1b2dc5cf4', 
       'DEBIAN_FRONTEND': 'noninteractive', 'SPARK_HOME': '/usr/local/spark', 
       'NB_USER': 'jovyan', 'HADOOP_VERSION': '2.7', 
       'APACHE_SPARK_FILENAME': 'spark-2.4.4-bin-hadoop2.7.tgz', 'SHELL': '/bin/bash', 
       'SPARK_OPTS': '--driver-java-options=-Xms1024M --driver-java-options=-Xmx4096M --driver-java-options=-Dlog4j.logLevel=info', 
       'APACHE_SPARK_VERSION': '2.4.4', 'SHLVL': '0', 'LANGUAGE': 'en_US.UTF-8', 
       'PYTHONPATH': '/usr/local/spark/python:/usr/local/spark/python/lib/py4j-0.10.7-src.zip', 'RF_LIB_LOC': '/usr/local/rasterframes', 
       'APACHE_SPARK_CHECKSUM': '2E3A5C853B9F28C7D4525C0ADCB0D971B73AD47D5CCE138C85335B9F53A6519540D3923CB0B5CEE41E386E49AE8A409A51AB7194BA11A254E037A848D0C4A9E5', 
       'XDG_CACHE_HOME': '/home/jovyan/.cache/', 'NB_GID': '100', 'PATH': '/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin;/opt/conda/lib', 
       'MINICONDA_VERSION': '4.7.10', 'KERNEL_LAUNCH_TIMEOUT': '40', 'JPY_PARENT_PID': '7', 'TERM': 'xterm-color', 'CLICOLOR': '1', 'PAGER': 'cat', 'GIT_PAGER': 'cat', 'MPLBACKEND': 'module://ipykernel.pylab.backend_inline', 
       '_PYSPARK_DRIVER_CONN_INFO_PATH': '/tmp/tmpeomq27my/tmpf81aik85'}

def preexec_func():
    signal.signal(signal.SIGINT, signal.SIG_IGN)

proc = Popen(command, stdin=PIPE, preexec_fn=preexec_func, env=env)
vpipkt commented 4 years ago

In a notebook, I did this:

image

and in the 2nd case with the 127 return code, I see the message /usr/bin/env: ‘bash’: No such file or directory in stderr

metasim commented 4 years ago

Is geopandas stomping on env?

vpipkt commented 4 years ago

Yes... seems to be something about conda ?

image

Will try to figure out more...

metasim commented 4 years ago

Semicolon is for windows....

vpipkt commented 4 years ago

I have narrowed it down to this:

$ python -c "import os; print(';' in  os.environ['PATH']); import rtree; print(';' in os.environ['PATH']);"
False
True

So basically it is something to do with rtree either the specific version (0.9.1) or the way it's packaged with conda.

vpipkt commented 4 years ago

Here is the issue:

https://github.com/Toblerity/rtree/issues/126

Fixed with this PR: https://github.com/Toblerity/rtree/pull/125

Merged Dec 3, 2019. Fix should be available from versions 0.9.2 onward.

vpipkt commented 4 years ago

Working on a fix to this that upgrades the container's minimum version of rtree.