locationtech-labs / geopyspark

GeoTrellis for PySpark
Other
179 stars 59 forks source link

WIP: GeoMesa Support and GeoPySpark Refactoring #650

Open aheyne opened 6 years ago

aheyne commented 6 years ago

This is a WIP PR to drive comments and discussion about the proposed merge of geomesa_pyspark module of GeoMesa into GeoPySpark and the subsequent promotion of GeoPySpark from just GeoTrellis/VectorPipe Python bindings to a project that brings general Geo* support to Python/PySpark.

Major thing of note and discussion points.

A beta release of this is available here. The GeoMesa pre-release that includes the complementary code is here with my working branch here. The changes in GeoMesa are to bind the pyUDT function in AbstractGeometryUDT to geopyspark.GeometryUDT. This fixes the Non issue we've been seeing.

I've included this in the README but for demonstration and reference purposes I'll include here a code sample of how to pull features from GeoMesa. This uses the YARN packaging mentioned above.

import geopyspark as gps
from pyspark import SparkContext
from pyspark.sql import SQLContext

conf = gps.geopyspark_conf(appName="test", master="yarn")
sc = SparkContext(conf=conf)
spark = SQLContext(sc)

params = { "hbase.catalog": "catalog" }
feature = "gdelt"
df = spark.read\
    .format("geomesa")\
    .options(**params)\
    .option("geomesa.feature", feature)\
    .load()

df.createOrReplaceTempView("gdelt")

spark.sql("select * from gdelt where st_contains(st_makeBBOX(0.0, 0.0, 90.0, 90.0), geom) limit 10").show()
+---------+--------------------+-------------------+--------------------+
|eventCode|          actor1Name|                dtg|                geom|
+---------+--------------------+-------------------+--------------------+
|      042|              TAIWAN|2017-01-01 00:00:00|POINT (6.73333 0....|
|      043|             VATICAN|2017-01-01 00:00:00|POINT (6.73333 0....|
|      161|SAO TOME AND PRIN...|2017-01-01 00:00:00|POINT (6.73333 0....|
|      042|              TAIWAN|2017-01-01 00:00:00|POINT (6.73333 0....|
|      043|              POLICE|2017-01-01 00:00:00|POINT (5.4851 5.4...|
|      160|CONSTITUTIONAL COURT|2017-01-01 00:00:00|POINT (8.78333 3.75)|
|      173|              PRISON|2017-01-01 00:00:00|POINT (8.78333 3.75)|
|      173|   EQUATORIAL GUINEA|2017-01-01 00:00:00|POINT (8.78333 3.75)|
|      042|            CAMEROON|2017-01-01 00:00:00|POINT (9.241 4.1527)|
|      051|             NIGERIA|2017-01-01 00:00:00|POINT (6.08333 4.75)|
+---------+--------------------+-------------------+--------------------+
jbouffard commented 6 years ago

@aheyne Is this ready for review? It's marked as WIP, so I wasn't sure if there's anything else you'd like to add to this PR.

aheyne commented 6 years ago

@jbouffard I'm planning on squashing and putting up a better PR if we decide to move forward on this. Interest is unclear to me and there are some outstanding questions; mainly the four discussion points above. There is also the more fundamental discrepancy in API level we're utilizing in the two projects (GeoPySpark being more RDD focused and GeoMesa_PySpark being Dataframe focused).

At the very least I think this is a good starting point to get a Python environment with both GeoMesa and GeoTrellis playing nice together. If we're okay with that and want to review/merge it so we can continue this line of development that'd be awesome.

jbouffard commented 6 years ago

@aheyne Ah, I see. At least on our end, I know there's definite interest in seeing GeoMesa_PySpark integrated into GPS. I've requested some time in these next few weeks to focus on this integration. I think that would be a good time to discuss your 4 points and the API discrepancies.

I'm okay with having this PR be the initial starting point for this integration. We could just mark it as experimental until a later point. What are your opinions @echeipesh @jpolchlo?