syoummer / SpatialSpark

Big Spatial Data Processing using Spark
http://simin.me/projects/spatialspark/
Apache License 2.0
145 stars 56 forks source link

Return more that just the key pairs #11

Open jomach opened 8 years ago

jomach commented 8 years ago

Hi,

It would be nice on the join to add some of the kind of customs ids. Because we need to broadcast all the data around if we want for example to append some data.

Example:

val matchedPairs = BroadcastSpatialJoin(sc, leftGeometryById, rightGeometryById, SpatialOperator.Intersects, 0.0) now I need to join all the data again.... the join should take this into account...

syoummer commented 8 years ago

Broadcasting everything instead of ids will degenerate the overall performance based on my several use cases. From my observation, returning matched pairs instead of whole datasets can significantly improve the performance (because I/O overhead dominates) . As such, relational joins are required to produce final results. However, comparing with spatial joins, relational joins are relatively efficient.

@velvia also pointed out this, and he suggested to leverage some features from DataFrame APIs. I have already done some implementations using DataFrame APIs and hopefully the next update will return real data instead of matched ids.

jomach commented 8 years ago

As input, for example I need to append one Information into other. Like :

Data set A.colum1, A.colum2... Data set B.colum1, ....

result: A.colum1, A*, B.colum1.

So at the moment I'm performing 3 joins to the this, and this is time consuming. At least for start the broadcasted index should return an ID that as (lelf_key, right_real_id)

Cheers Jorge

velvia commented 8 years ago

I would propose this API:

The right side should specify an ID column. The join should be a function.

Therefore:

val spatialIndex = broadCast(rightSideData, geoColumn=“_geom”, idColumn=“_id”) val leftSideDataWithJoin = leftSideData.withColumn(“_id”, spatialJoin(spatialIndex, “latColumn”, “lonColumn”)) // ^^— adds a column which contains the geometry ID from the right side, spatially joined quickly using broadcast SRTrees just like today

// Now, you can use regular joins with the right side table to add columns

sqlContext.sql(“select l.colA, l.colB, r.street, r.county, r.zipcode from l, r WHERE l._id == r._id”)

On Apr 13, 2016, at 8:37 AM, Jorge notifications@github.com wrote:

As input, for example I need to append one Information into other. Like :

Data set A.colum1, A.colum2... Data set B.colum1, ....

result: A.colum1, A*, B.colum1.

So at the moment I'm performing 3 joins to the this, and this is time consuming. At least for start the broadcasted index should return an ID that as (lelf_key, right_real_id)

Cheers Jorge

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/syoummer/SpatialSpark/issues/11#issuecomment-209409338