rjurney / Agile_Data_Code_2

Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition
http://bit.ly/agile_data_science
MIT License
456 stars 306 forks source link

issues running on ec2 #71

Closed peopzen closed 5 years ago

peopzen commented 6 years ago
  1. registered AWS account and created a user and generated access_id and secret_key and chose a region for "aws configure" command, I passed ec2 instance installation. It took time to get everything in place.

  2. run ec2_create_tunnel.sh to create connection. htt://localhost:8080 works instantly

  3. ssh to instance and start jupyter-notebook, then http://localhost:8888 works fine. http://localhost:5000 is still not working

  4. pyspark command works and passed test code

  5. mongo agile_data_science works and passed test code

  6. running ./ch02/pyspark_mongodb.py failed, says no pyspark found. "pip list" couldn't find pyspark. so "pip install pyspark", rerun, tells "sc not found". adding "from pyspark import SparkContext, sc=SparkContext()", now it generates messy errors. Frustrated.......

  7. try "curl -XPUT 'localhost:9200/....", it reports "Connection refused" elasticsearch command not found

rjurney commented 6 years ago

@peopzen Thanks for reporting this. Let me try to reproduce this...

sewald101 commented 6 years ago

Inserting these lines before the pyspark import in ~/Agile . . . /lib/pymongo_spark.py got me over the import error described by @peopzen at Step 6. import findspark findspark.init()

But now it's as if pyspark is not "activated" in the pyspark_mongodb.py. I get this error:

Traceback (most recent call last): File "pyspark_mongodb.py", line 10, in csv_lines = sc.textFile("data/example.csv") NameError: name 'sc' is not defined

It's as if the script has failed to launch a session of pyspark such that the 'sc' object is undefined.

Solved: My duh. I wasn't running the script in pyspark. Needed to do this in the ssh terminal: spark-submit pyspark_mongodb.py Then, though, it couldn't find example.csv, so I had to alter that line in pyspark_mongodb.py for the proper path: csv_lines = sc.textFile("../data/example.csv")

peopzen commented 6 years ago

Running this project on EC2 is not cheap. Following the setup process, I ran an EC2 instance for 141 hours. It charged me $64 last month. I think I just set up free account for 12 months. How can AWS charge me? This project setup cost 20G disk space. It is still fine just under 30G of free tier. But wonder why it need 20G for such a simple project (a couple of applications and downloaded data)?

rjurney commented 5 years ago

@peopzen You shouldn't leave servers running when you aren't using them as it runs charges up. We need 20GB of disk because otherwise we run out while we use Spark.

rjurney commented 5 years ago

@sewald101 Glad you figured it out.