rjurney / Agile_Data_Code_2

Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition
http://bit.ly/agile_data_science
MIT License
456 stars 306 forks source link

ImportError: No module named 'pyspark' #78

Closed sonhmai closed 5 years ago

sonhmai commented 6 years ago

Got error ImportError: No module named 'pyspark' when running python ch02/pyspark_mongodb.py

The other examples in chapter 2 ran fine.

emmajeanmaher commented 6 years ago

pip install pyspark --user worked for us.

rjurney commented 5 years ago

All pyspark examples are intended to be run inside the pyspark shell. Sounds like you ran them in python?

arnaudbouffard commented 5 years ago

@rjurney you didn't get an answer here but that's indeed also the trap I think I fell in, further in the book though: in the Processing Streams with PySpark Streaming section,

Now, in iPython, the following code will initialize a PySpark StreamingContext. You can follow along in ch02/pyspark_streaming.py.

got me opening a new ipython console at the root of the EC2 filesystem and... getting a No module named 'pyspark' error.

From your answer to the current issue I understand the code instead needs to be run inside the Pyspark session that's opened with

pyspark --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.1.0

It really isn't obvious as it is the first time the book uses "iPython" to refer to the Spark console (Figure 2-14 legend says "iPython PySpark console" but is easily missed/not read).

rjurney commented 5 years ago

@arnaudbouffard Thanks, it looks like I should load that in all pyspark sessions. Ideally all scripts run in straight Python, however currently the intention is for all work to occur in the new Jupyter notebooks for each chapter, for example ch02/Agile_Tools.ipynb.