ropensci / unconf16

rOpenSci's San Francisco hackathon/unconf 2016
http://unconf16.ropensci.org
24 stars 7 forks source link

geoplyr - dplyr style manipulation for geospatial data #22

Open eamcvey opened 8 years ago

eamcvey commented 8 years ago

I'm somewhat new to geospatial analysis, and while the tools in R can do all kinds of things, I feel like they operate the "old" R way, not the "new" way I'm now accustomed to from using dplyr, tidyr, and friends. I think there's room for a package that could make working with geospatial data easier and more elegant -- in particular, handling the sp objects intuitively.

karthik commented 8 years ago

:100: I would love a dplyr for spatial data. There are some tools in the rOpenSci suite from @sckott that we should discuss.

hrbrmstr commented 8 years ago

YES! @eamcvey want to help also with a GSoC 2016 R proposal? https://github.com/rstats-gsoc/gsoc2016/wiki/spatula:-a-sane,-user-centric-(in-the-mental-model-sense)-spatial-operations-package-for-R (geoplyr sounds way cooler than spatula)

hrbrmstr commented 8 years ago

FYI: Just learned abt this from Roger https://github.com/edzer/sfr. Super cool (and well-written) idea!

edzer commented 8 years ago

Rumours go that the ISC proposal for sfr might get funded -- of course, the ISC still has to announce this first publicly. @karthik and @eamcvey : as I use R mostly the "old" way, could you provide some use cases or mock ups how you would like (new) sp classes to behave more intuitively?

eamcvey commented 8 years ago

@hrbrmstr The GSoC proposal is a great description of what I was thinking - did this get submitted? Some other examples of doing things the "new" way would be:

eamcvey commented 8 years ago

@edzer It looks like the sfr proposal has gotten funded. I'm working on mocking up how I wish geospatial data manipulation worked in R.

sckott commented 8 years ago

@eamcvey i think we're still waiting to hear back on whether @hrbrmstr student's project gets picked as well

eamcvey commented 8 years ago

Start of mockup of how spatial analysis could be if it were possible to work with geometry columns in dataframes: https://github.com/ropenscilabs/geoplyr/blob/master/ideal_mockups.Rmd

eamcvey commented 8 years ago

@edzer, I don't understand the simple features stuff entirely, but in my optimistic imagination, it could provide the types of simple geometric objects (polygons, points, etc.) that would go into the geometry columns of dataframes as I imagine in this mockup above. If so, I think there could be huge benefit to fitting spatial objects into columns of dataframes and then having access to the existing spectacular "new R" tools in dplyr, tidyr, and purrr to manipulate them (with special functions for operating on the geometry columns). Hadley assures me that there's no fundamental reason this isn't possible : )

mdsumner commented 8 years ago

My efforts in this area are in two packages, gris and spbabel:

https://github.com/mdsumner/gris

https://github.com/mdsumner/spbabel

Gris is well-developed but I'm not happy with the overall design and user-view yet. It provides a db-like "normalized" structure for spatial objects in multiple linked tables. The point is that you can more easily work on the components (vertices, pieces*, objects) individually, generate other forms like edge-based or primitives-based meshes, and ultimately back-end it with a generic database.

Spbabel is simpler, and starts "in the middle" with something like the ggplot2::fortify (or raster::geom) table of vertices without enforcing uniqueness.

I'm trying to build it into a bigger story but these two blog posts are about as far as it goes:

http://mdsumner.github.io/2015/12/28/gis3d.html

http://mdsumner.github.io/2016/03/03/polygons-R.html

Very keen to explore this idea more, sp is fundamentally limiting in several ways (just like GIS is) but I'm not saying we should disown it, I feel we just need to be able to transform between different forms much more easily.

I'm still catching up with this discussion, just wanted to drop this in :)

I also have done some work on using dplyr with ODBC, which allows me to read in from Manifold GIS directly, amongst other things. I see this all fitting together really nicely with dplyr as the new centre.

https://github.com/mdsumner/dplyrodbc

https://github.com/mdsumner/manifoldr

Cheers, Mike.

mdsumner commented 8 years ago

@eamcvey do you have the data from your Ideal doc in concrete form? I'd like to work through your document and use it to explain how I see things. If you have those actual data and can share that would be awesome. This is helping me focus somewhat. :)

eamcvey commented 8 years ago

@mdsumner You're calling my bluff -- I don't actually have that data ; ) But if it's helpful, I can get it, or something quite similar, fairly easily. If the document is helping provide focus, then it's doing its job!

edzer commented 8 years ago

Thanks for the mockup, @eamcvey ! I agree with @mdsumner that we'd need some sample data with it in order to get more concrete.

For your information, sp::aggregate does aggregate polygon information, for the case of nested polygons (say, from districts to provinces) as well as non-nested polygons (assuming constant value throughout the polygons). Your last example would now look like

new_district_df <- aggregate(census_bg_df, list(new_district_df$assigned_district), sum)

which follows the stats::aggregate semantics. Pretty compact, and it dissolves polygons.

Anyway, it would be great if you could for instance provide a census_bg_df shapefile to start with.

mdsumner commented 8 years ago

@eamcvey you must be motivated by real-world data here so of course it's helpful to have actual examples - I don't consider it bluffing here :)

Thinking about the "geometry column" thing - I think that's pretty easy to do, but what I don't like about it is that it doesn't naturally provide a topological data structure - there's no way to share vertices between objects, they all just get copied out in a recursive structure, just in text or in a binary blob - you might as well serialize a Polygons object for example, and store that in a column. It's not hard, it just doesn't really help from my perspective. Topology is what is missing from sp and from most GIS implementations. Also "Polygons" are really just lines with a fill-rule, so you can't pop them out into X-Y-Z - we really need proper surfaces that can be decomposed to triangles, and "polygons" defined by cycles in the mesh are a special case.

There's no way of avoiding the need for at least two tables, one for the vertices and the identifiers for object, part, holiness, and path-ordering, and one for the objects. I just like to take it further, so you can really "normalize" and have vertices (x, y, z, time, etc. with no limit) plus an ID, for that you need at least a vertex table, a branches (or "parts" or pieces" table), and the objects. To normalize the vertices (store only unique rows) you need a vertex-link-branches table.

Gris does this, but it's not dplyr-able yet - I'm working on that. Gris should have the choice of "topology" model - it has branches (the poly-ring, line-string, point, multi-point stuff), and primitives (triangles and/or edges for lines) and it should also have edges (line segments for polys or lines) and the ability to switch between. The constrained Delaunay triangulation in RTriangle is so fast that I think it's worth doing all of this upfront. Then the user can go further to decompose to smaller triangles, shorter line segments , triangles with nicer angles etc. etc. - but the branches, edges and primitives should always be available. It might be a special case to not triangulate, but it's easy enough anyway.

Spbabel is dplyr-pipeable, and has examples to work with the basic verbs on objects, and on the vertices using the sptable(x)<- trick (suggested by @hadley). I think the "vertex-table" in the middle view is a better place to start than gris - it's essentially the ggplot2 fortify table, plus the linked objects. I can go from that to the more-tables more-normalized gris view though for dplyr-abling that's probably not necessary.

mdsumner commented 8 years ago

@edzer thanks for the aggregate example, I actually forget sp has some of this manipulation built-in. I can see we could usefully have options for group_by() %>% summarize () that unioned objects together using this. I wonder if we need extra arguments to differentiate the summarize function/s from the topological tasks, or if it's best done with new verbs? I need to try this out - I'll be able to in the next few weeks, and as ever very keen to hear from anyone interested in doing this.