Open jomach opened 8 years ago
Broadcasting everything instead of ids will degenerate the overall performance based on my several use cases. From my observation, returning matched pairs instead of whole datasets can significantly improve the performance (because I/O overhead dominates) . As such, relational joins are required to produce final results. However, comparing with spatial joins, relational joins are relatively efficient.
@velvia also pointed out this, and he suggested to leverage some features from DataFrame APIs. I have already done some implementations using DataFrame APIs and hopefully the next update will return real data instead of matched ids.
As input, for example I need to append one Information into other. Like :
Data set A.colum1, A.colum2... Data set B.colum1, ....
result: A.colum1, A*, B.colum1.
So at the moment I'm performing 3 joins to the this, and this is time consuming. At least for start the broadcasted index should return an ID that as (lelf_key, right_real_id)
Cheers Jorge
I would propose this API:
The right side should specify an ID column. The join should be a function.
Therefore:
val spatialIndex = broadCast(rightSideData, geoColumn=“_geom”, idColumn=“_id”) val leftSideDataWithJoin = leftSideData.withColumn(“_id”, spatialJoin(spatialIndex, “latColumn”, “lonColumn”)) // ^^— adds a column which contains the geometry ID from the right side, spatially joined quickly using broadcast SRTrees just like today
// Now, you can use regular joins with the right side table to add columns
sqlContext.sql(“select l.colA, l.colB, r.street, r.county, r.zipcode from l, r WHERE l._id == r._id”)
On Apr 13, 2016, at 8:37 AM, Jorge notifications@github.com wrote:
As input, for example I need to append one Information into other. Like :
Data set A.colum1, A.colum2... Data set B.colum1, ....
result: A.colum1, A*, B.colum1.
So at the moment I'm performing 3 joins to the this, and this is time consuming. At least for start the broadcasted index should return an ID that as (lelf_key, right_real_id)
Cheers Jorge
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/syoummer/SpatialSpark/issues/11#issuecomment-209409338
Hi,
It would be nice on the join to add some of the kind of customs ids. Because we need to broadcast all the data around if we want for example to append some data.
Example:
val matchedPairs = BroadcastSpatialJoin(sc, leftGeometryById, rightGeometryById, SpatialOperator.Intersects, 0.0) now I need to join all the data again.... the join should take this into account...