harsha2010 / magellan

Geo Spatial Data Analytics on Spark
Apache License 2.0
533 stars 149 forks source link

write results back to geojson, shapefile etc? #128

Open jim2 opened 7 years ago

jim2 commented 7 years ago

Magellan is great! Is there a way to write back to geojson / shapefile? I have figured out how to write csv (without geometries), but I'd love to be able to output my geometries.

Something like:

df.write .format("magellan") .option("type", "geojson") .save("results")

harsha2010 commented 7 years ago

@jim2 thanks! Is there a reason not to write back as parquet if you want to store the geometry? Basically geojson if a very inefficient format and if the results of an entire table have to be written into it, we need to shoehorn all non geometric columns into a properties map which might not be the best thing to do. Shape files are even worse IMHO. The non geometric columns would have to be written out in DBF format which is archaic and difficult to write as. Writing it out as Parquet doesn;t require any further effort today. However, it might not be compatible with legacy systems that only read GeoJSON/ Shapefiles. I could do one thing: for GeoJSON format, I could output the raw json as a column, which you can choose to keep and write out. Then if you write it as CSV/ Parquet, one of the columns would be the geoJSON string. Let me know if this works

jim2 commented 7 years ago

Correct - its for portability w/ legacy systems. Parquet is my approach for persisting data in hdfs and then later using w/ magellan, but if I need to move some geometries + attributes back into some other system, I am currently just exporting attribute CSV and joining it to the original geometry in my GIS. It would be better if I could just output from magellan. I like your idea extra column for geojson - that would work for me. I tried df.write.format("json").option("type", "geojson").save("test"), which worked but saved as raw json, not geojson.

Charmatzis commented 7 years ago

@harsha2010 I believe that exporting in GeoJson it would a very nice feature, because after exporting your data to GeoJson you can imported in any DB with spatial extent(Postgres/PostGIS, SQL Server, Oracle, My SQL etc..) using the ogr2ogr util.

Something like that it would be great df.write .format("magellan") .option("type", "geojson") .save("results")

and also even more

df.write .format("magellan") .partitionBy("some_property e.g. id") .option("type", "geojson") .save("results")

I had asked a similar question in StackOverflow https://stackoverflow.com/questions/44500233/export-a-datasetsomeclass-to-geojson-format-in-spark/44521522#44521522

harsha2010 commented 7 years ago

thanks @jim2 , @Charmatzis I can put something quick together here. Would one of you be willing to test it out a bit and give feedback before merging in?

jim2 commented 7 years ago

I'd be happy to test and offer feedback. Thanks!

harsha2010 commented 7 years ago

thanks @jim2 There is a question i have here on design: if we write a data frame as GeoJSON, what do you do with the non spatial columns? for example if you have [polygon, neighborhood, colA, colB,..] do you write out colA, colB etc as properties attached to the feature collection? Would that be the expected behavior?

jim2 commented 7 years ago

Yes @harsha2010, I'd like it to look like this

https://gist.github.com/jim2/63527f159a50635c120b9fa8f64b5288

with the nonspatial properties from the dataframe included.

Charmatzis commented 7 years ago

That's why it is a nice improvement to rename metadata class to properties... https://github.com/harsha2010/magellan/issues/139

harsha2010 commented 7 years ago

@Charmatzis that is an orthogonal issue: we can discuss it in #139 @jim2 thanks, i think that makes sense... I'll put a skeleton PR together and ping you to test/ give feedback

pipwoet commented 7 years ago

@harsha2010 I was also looking for this feature since i was thinking it was impossible to read back a parquet file containing Magellan types. It would be nice, if you can you give us an example on how to read back a parquet file containing Point and Polygon type.