harsha2010 / magellan

Geo Spatial Data Analytics on Spark
Apache License 2.0
534 stars 149 forks source link

How to run the NYC Taxicab analysis notebook with Azure DataBricks #236

Open gadgetman4u opened 5 years ago

gadgetman4u commented 5 years ago

I would like to run the NYC Taxicab analysis notebook with Azure Databricks but the data is in S3. How do I save the data into Azure? Would I save to Azure Data Lake Store and then mount it to Databricks?

Thanks.

gadgetman4u commented 5 years ago

I already saved the neighborhoods.geojson file into Azure Data Lake Store and placed the path to it in the dbutils.fs.mount. How do I extract the neighborhoods and trips as per the code here?

val trips = sqlContext.read .format("com.databricks.spark.csv") .option("comment", "V") .option("mode", "DROPMALFORMED") .schema(schema) .load("/mnt/nyctaxicabanalysis/trips/*") .withColumn("point", point($"pickup_longitude",$"pickup_latitude")) .cache()

val neighborhoods = sqlContext.read .format("magellan") .option("type", "geojson") .load("/mnt/nyctaxicabanalysis/neighborhoods/") .select($"polygon", $"metadata"("neighborhood").as("neighborhood")) .cache()

Thanks.

gadgetman4u commented 5 years ago

Does anybody know how I can upload the data into Azure so I can extract the neighborhoods and trips?

guiferviz commented 5 years ago

I found this today that may be interesting for you. I'm not the author: https://lamastex.github.io/scalable-data-science/sds/2/2/db/032_NYtaxisInMagellan.html It only works for me using the Databricks runtime with Spark 2.1.1.