huilisabrina / covid-19-simul

Efficient simulation and graphical modeling of Covid-19 spread
1 stars 0 forks source link

Quick start guide for using GraphFrames on AWS #2

Open huilisabrina opened 4 years ago

huilisabrina commented 4 years ago

Hi @smwu @intekhab8 @beancamille ,

As you already know....after countless trials and errors, I finally got GraphFrames to work on AWS(!) The main issue was that there are quite a few different versions of GF. I had to test out a few of them before I settled on the one that works for our purposes.

If you log on to your AWS instance, and install Spark (infrastructure guide I9), then run the following codes to start Spark, it should work:

pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11

So far I have just been working on the single local node Spark with this package. We'll have to test this will work once we move to the EMR later on. I haven't found the best solutions yet, but maybe we can just add this line to our python script before spark-submit:

os.environ["PYSPARK_SUBMIT_ARGS"] = ("--packages graphframes:graphframes:0.6.0-spark2.3-s_2.11 pyspark-shell")

Here are some packages that you could import to play around with graphframes. Also included are the common SQL + Spark packages. Note that we may want to specify the function names as needed, to avoid conflicts with the built-in functions (e.g. sum())

# import libraries
from graphframes import *
from graphframes import graphframe as GF
from graphframes.lib import AggregateMessages as AM
from graphframes.examples import Graphs

# SQL + Spark
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark.sql.functions import col, lit, udf, when, concat, collect_list
from pyspark.sql.functions import sum as fsum
from pyspark.sql.types import *

Also, we want to suppress the excessive printing of log information:

conf = SparkConf().setAppName('cluster_run')
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")
sql_context = SQLContext(sc)

Lastly, I literally went through this entire guide in half and hour. It's very basic, and I don't think we need all of them for our project. That said, it's definitely useful to get a sense of what the language is like. Let me know if you have any questions.

Best, Hui