I wanted to pre-generate the index for a very large set of polygons (loaded from Shapefile) and store as parquet so that I can reuse it in frequent production processes, but it seems that the ZOrderCurve type column named "index" is ignored when joining the parquet data with a list of points.
import org.apache.spark.sql.types._
import magellan.{Point, Polygon}
import org.apache.spark.sql.magellan.dsl.expressions._
val schema = new StructType(Array(
StructField("latitude", DoubleType, false),
StructField("longitude", DoubleType, false)
))
val sample = spark.read.schema(schema).option("header",true).csv("./sample.csv.gz")
magellan.Utils.injectRules(spark)
//spark.read.format("magellan").load("s3://myBucket/my_shapefile_folder")
// .withColumn("index", $"polygon" index 15)
// .selectExpr("polygon", "index", "metadata.ID AS id")
// .write.saveAsTable("shapes")
sample.join(spark.table("shapes"), point($"longitude",$"latitude") within $"polygon").explain()
@zebehringer can you give this PR a try? The issue I think is that the nullability column is reset(a bug in Spark SQL) when Spark SQL writes to Parquet.. and when we read back this causes a schema mismatch
I wanted to pre-generate the index for a very large set of polygons (loaded from Shapefile) and store as parquet so that I can reuse it in frequent production processes, but it seems that the ZOrderCurve type column named "index" is ignored when joining the parquet data with a list of points.
Here's the plan: