sequenceiq / docker-spark

Apache License 2.0
765 stars 282 forks source link

Using pyspark in standalone scripts #41

Open kynan opened 8 years ago

kynan commented 8 years ago

How can I import pyspark to create a SparkContext in a standalone script?

Running

PYTHONPATH=/usr/local/spark/python python -c 'import pyspark'

fails:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/spark/python/pyspark/__init__.py", line 41, in <module>
    from pyspark.context import SparkContext
  File "/usr/local/spark/python/pyspark/context.py", line 31, in <module>
    from pyspark.java_gateway import launch_gateway
  File "/usr/local/spark/python/pyspark/java_gateway.py", line 31, in <module>
    from py4j.java_gateway import java_import, JavaGateway, GatewayClient
ImportError: No module named py4j.java_gateway

And indeed py4j seems to only exist as a zip file in $SPARK_HOME it is not "installed".

kynan commented 8 years ago

My current workaround is manually unzipping py4j inside the container:

(
cd $SPARK_HOME/python
unzip lib/py4j-0.8.2.1-src.zip
)
maxgrenderjones commented 7 years ago

Try using findspark