r-spatial / rspatial_spark

This is the repo that sparked https://github.com/r-spatial
36 stars 1 forks source link

Any wisdom in creating an rspatial package analagous to tidyverse? #5

Open jhollist opened 7 years ago

jhollist commented 7 years ago

Given that support for spatial in R is spread across many packages is there any wisdom or desire to create a single package, similar to what the tidyverse package does, for the suite of spatial packages (sf, sp, raster, rgdal, rgeos, etc.)?

ateucher commented 7 years ago

I think sf is a different beast in that list. rgdal, rgeos, and (parts of) raster are all designed for sp objects (the spverse I suppose 😉 ).

sf on the other hand encapsulates much of the functionality in rgdal and rgeos by itself, and I don't think there are any other packages that go along with it (yet). I think one thing that's missing so far is the sf version of many raster functions like intersect, union etc (i.e. extensions to GEOS functions that retain attributes)

edzer commented 7 years ago

@ateucher well put. Could you open an issue at https://github.com/edzer/sfr Re: intersect / union etc? I'm still trying to figure out whether these can be rephrased in dplyr verbs when touching attributes: would union for instance match summarize?

tim-salabim commented 7 years ago

@ateucher I am in the process of adding sf support to mapview. There's no mapview support yet, but with the latest development version of mapview attached leaflet should work for POINT, MULTIPOINT, POLYGON and MULTIPOLYGON as well as objects of class 'sf' (thanks to @mdsumner).

library(mapview)
library(sf)

outer = matrix(c(0,0,10,0,10,10,0,10,0,0),ncol=2, byrow=TRUE)
hole1 = matrix(c(1,1,1,2,2,2,2,1,1,1),ncol=2, byrow=TRUE)
hole2 = matrix(c(5,5,5,6,6,6,6,5,5,5),ncol=2, byrow=TRUE)
pts = list(outer, hole1, hole2)
(pl1 = st_polygon(pts))

leaflet() %>% addPolygons(data = pl1)

leaflet() %>% addTiles() %>% addCircleMarkers(data = st_as_sf(breweries91))
jhollist commented 7 years ago

@ateucher Thanks for the clarification. Other than installing sf I haven't had a chance yet to actually dig into it so my question certainly reflected my ignorance. And given that description, sf may end up serving this function (sorta) on its own. At this point I think I have my answer!

ateucher commented 7 years ago

Thanks @edzer - I'll open an issue.

mdsumner commented 7 years ago

Despite this being closed I'm going to add these thoughts while they're fresh. Maybe it's enough for starting a story on its own here.

A concern I have is the fragmentation of methods across a lot of packages for "ingesting" or somehow re-structuring these specialist objects, objects that do have a very stable and strict hierarchical definition. My hope was for mdsumner/spbabel to be a place to collect general translator methods, but it suffers from 1) a heavy dependency list 2) lack of focus because I am still learning about core concepts in primitives-based object structures

Here are some of the decomposition/translations that occur that don't have centralized support (I know there are more):

Fortify is really an ugly lowest-common-denominator process that is now (being) moved broom::tidy, and it can be thought of as the "single-table" version of an sp/sf hierarchy, where object/part ID is spread across multiple rows of vertices. Nested lists can be nested to two levels (like sp) or two three levels (like sf where parent-island/hole-child parts are unambiguously linked), and be matrices/vectors (sf) or list-columns (leaflet). Note that fortify is really two tables, you are supposed to only join on the object/feature metadata that is need for particular aesthetics, but you don't want to copy every feature's attribute onto every coordinate!). I've got illustrations of these issues in mdsumner/spbabel, r-gris/rangl and r-gris/polyggon, and here https://mdsumner.github.io/oddhack/vignettes/objects-parts.html but it's unfinished and doesn't have coherence yet.

I think a very useful shared package - for package developers - would be utitiles to translate between these forms, to lay them out, describe where they are used, and provide method/s for applying them in the most efficient ways. Some will be whole-object recursive lapplys in one big slurp, and some will be row-iterating or group-by applied - and both are needed for some conversions, used in different contexts.

Currently, the trunk of spbabel has a "sptable" to fortify sf and sp (copied from raster::geom for raw speed, but with my naming scheme and data frame choices), and this is used to derive three- and four-table forms that normalize the two-table scheme above. mapview has methods for sf conversion to lists, and there are many other discussions going on without a central scheme.

Where do these "structural-translator" utilities and guides belong? They don't belong in sf, because they are also useful for non simple features applications, and they don't belong in leaflet or mapview - although there's no harm in having local copies in my opinion. To be really widely applied, in the Imports sense the package would need to be very low on dependencies, otherwise people can copy the raw utility function, knowing that it's a well-understood and discussed best-practice efficient way to do that task.

Also see ongoing discussion around gg for sf, where ggplot2 aspires to an "ultimate solution" but there's still a need to find the right compromise for today before changing ggplot2: https://github.com/edzer/sfr/issues/88 @edzer @hadley

An example problem I see that would be helped by this central resource is the nested-list standard used by mapview/leaflet, that only has two levels of hierarchy, same as sp, but that means it's ambiguous which hole belongs to which island (though the order usually is isomorphic to the sf convention). This is something that would be completely opaque to many package developers, but it should be laid out and described. This is fine for the sf -> leaflet pathway, because the even-odd rule for inside/outside definition works fine no matter what the hierarchy is. but if edits are made in the browser the parent/child identity will be needed to rebuild the sf structure sensibly. Note that the NA-separated coordinate lists used by polygon/polypath are analogous to a single simple feature, i.e. only one row, one multipolygon - because the whole thing has the same aesthetics, fill/lty etc. - and also there's no place to store the information about hole vs. island. (It's the same in grid, though that uses IDs rather than NAs to encode the separate pieces).

Another is the GeoJSON issue, currently the translators (geojsonio, rmapshaper) use writeGDAL/readGDAL) and this is something that could easily occur with GDAL file I/O. Maybe that's in the pipeline, and if so I think it's a good example of "common philosophy" that we should find a way to make widely available outside the scope of its original motivation. @ateucher @sckott

Finally, I feel responsibility to write this guide but I need some general agreement about the need for it and the hope that it will be used somehow.

Re "fortify": it's ugly but it's very powerful - it's an easy middle ground for normalizing a lot of different types, and it might be a good starting point for a "universal centre" - the two tables of fortify can be made much more efficient by normalizing to three and even four tables, this is kind of what ggvis does with nested lists of coordinates, but nested lists cannot apply vertex de-duplication.

tim-salabim commented 7 years ago

Given the extent of the discussion and the generality of the issue, I reopen this.

I agree with @mdsumner that a central package for structural conversions of types is a good idea. However, I don't quite see a way to keep the dependencies (Imports) low. I actually don't feel it is necessarily a bad thing for a central conversion package to have many Imports (or am I misunderstanding something here?). The case for leaflet/mapview is a little special as we rely on expectations of an external library (leafletjs) so we can only address this on our side. For R-based workflow we can make sure that types are supported on all sides, but I still think that it is a good idea to avoid overhead and possible bug introductions by outsourcing conversions to an individual package.

ateucher commented 7 years ago

Lots of good points @mdsumner! FYI I am almost finished writing sf -> geojson methods for geojsonio, which I will in turn use in rmapshaper. I'm parsing sf[cg] objects into lists that mirror geojson structure (probably similar to what you did for mapview, further highlighting your point) and using jsonlite to convert to geojson. It is working well, though there are a few outstanding things to finish.

mdsumner commented 7 years ago

I am thinking now that each package should have the translator, but to a generic, classless set of tables that is generally understood. It's pretty much what sptable::map_table returns, but without the vertex de-duplication. Why not have a family of decomposition functions in each specialist package? There's a few types to consider, with pros and cons in different contexts: fortify, map_table, nested tibbles (doubly or singly nested versions of fortify, in one table), and the ggvis model (nested vectors). I'll stop thinking out loud and actually do this, but I think it makes sense - the specialist packages agree on an informal standard and each takes responsibilty for "this is the best/fastest way to decompose me". I don't think there's one best decomposition form, but I'll think more about that.

The implication then is that translator packages can ingest any type in its decomposed form and build translation semantics in whatever way is best. The great thing about using data frames is they can be offloaded to / read from database at any time without specialist tools.

@ateucher is there a generic form of sf that you would prefer to start with, or that makes sense? I'll look

edzer commented 7 years ago

@mdsumner : I believe that classes are there to help convey and understand meaning - why would you want a class-less set of tables? Or do you mean data.frames with a class attribute set?

Although I try to follow your argumentation, to be completely honest, I haven't seen a strong use case for spbabel, or the concept behind it. It may be the case that we speak different languages. Which concrete problem does it solve now, or does it want to solve?

What we do lack in R is a topology representation of features (rasters is trivial). And so we cannot simplify while preserving topology, e.g. st_simplify (or gSimplify) will have a preserveTopology=TRUE argument, but that only preserves the topology of the single feature, not of the set of features (see here). Topology is needed to simplify or repair features; now you'd need GRASS or PostGIS for this.

Another worry is the future of raster analysis: raster is being largely rewritten in C++ but still has to go through package rgdal for most I/O, a bottle neck, and a limitation to local disk-based storage. It has reimplemented most of the (sp/rgdal/rgeos) stuff wherever it thought them as useless or dogmatic (union, intersect, but also print). Some thoughts here and here, but I realize it is a massive code base. We need a dplyr that proxies remote raster data cubes, like SciDBR does.

Coming back to the theme of this issue: this comment suggests to me that the tidyverse dogma comes from ggplot2's design and the choice to build on NSE. That doesn't go well with S4, which is indeed unfortunate.

rsbivand commented 7 years ago

Two tiny comments, so that they don't get lost:

a) how well does sf/tidyverse play with analysis (I can see the logic behind the dogma, and could also see how to add a "neighbours" list column to a data.frame-like object)?

b) Does this extend to space-time, or do we have to go time-wide to avoid replicating the geometry list column?

mdsumner commented 7 years ago

Re topology, with my approach in rangl, I can create the planar straight line graph from a set of features. Then when the edges have been fixed, I can recompose the features. I do not have the fortitude to try to apply the combination of numeric precision and intersection and neighbour test and user choices required to detect and modify those edges. I think some of the issues can be detected and fixed automatically, but you'd want a very clear delineation between auto-fix, auto-fix within tolerance, user-choice operations. Personally I use Manifold to tell me the details: http://www.georeference.org/doc/topology_factory.htm

(If someone wants to try the fixes in the middle, I'm confident of providing the edge-graph, and the ability to recompose it.)

But, that is also not primarily why I'm interested in topology, I think those planar-feature problems are well supported elsewhere, and I don't think I have anything to offer, but I'm happy to show the round-trip that I would do to find and fix the problems, I just see it as a difficult user interface to program all those details.

I'm interested in topology because it's needed for actual surfaces composed of 2D primitives, it provides a general structure for storing any data on the "features" framework, and it automatically has the right properties for many models.

Regarding Roger's point b) if you use the rangl entity tables, you will naturally always "go long", with geometric dimensionality down the rows. But this assumes you don't have just one table, you need vertices, parts, and objects - and for topology you must de-duplicate the vertices with an added branch-vertex link table, and decompose the parts table to segments, and an added object-segments link table. There's no way to de-duplicate these things and keep them in nested structures in one table (I assert). When it's tables, you automatically have a slot to put any new attributes, on the vertices (i.e. bathymetry), on the parts, on the segments, or even on the link indexes on the instances of the vertices (i.e. time) rather than on the unique set.

I've been figuring these things out pretty slowly, and I appreciate that it's probably not familiar language - I'm still working on telling the story and organizing the tools better.

This is all very helpful, it's good to air these ideas and see what gels with others and where it doesn't stick. I'll try to take some of this for examples, so feel free to throw actual data and problems at me to test my claims.

edzer commented 7 years ago

Good that this is helpful. To help us understand which problems you would like us to throw to you, could you summarize the claims rangl, spbabel, gris etc make in a bullet list with less than five bullets, each taking at most a single line?

mdsumner commented 7 years ago

Hmm, I'm getting a not so subtle hint :)

edzer commented 7 years ago

Thanks, Mike!

hadley commented 7 years ago

@edzer you have not correctly interpreted that comment - it has nothing to do with S4 and the constraints are unrelated to NSE. It's even barely related to the tidyverse, except that ggplot2 expects tidy data (which implies list columns for more complicated stuff). The restriction is imposed by ggplot2's grammar.

ateucher commented 7 years ago

@mdsumner to your question:

is there a generic form of sf that you would prefer to start with, or that makes sense?

I actually find the sf structure itself really easy to deal with, even if parsing it involves a lot of unclass()-ing. I'm not sure if decomposing it to a flat table(s) and subsequently rolling it up into a geojson-like list would be more efficient (at least in this case).

I can understand your desire to have a central form that can be translated to any spatial class (if I'm understanding you correctly), but I don't see why sf can't be that form?

Nowosad commented 7 years ago

@ateucher There is a lot information about sf (and the difference between sf and sp) - just look at the github page (https://github.com/edzer/sfr), especially Simple features for R and sf proposal

mdsumner commented 7 years ago

I think that self-intersecting polygons is a common enough problem in shapefiles and a reasonable example to start with that I will try it.

I didn't want to go into topology here as it's too early still, before that I want a more general framework for translating spatial forms. It's hard to explain why without writing a lot, and it seems that is not helpful.

No sf is not enough, you cannot store track data in its full detail, and you cannot store topology information in the sense of shared vertices, either shared between features or between parts. It's not an sf limitation, it's a simple features limitation.

I'll take feedback given here and use it to clean up my message. Please feel free to comment on any aspect, but in particular limitations that you face are welcome as grist for the mill.

mdsumner commented 7 years ago

If it makes any difference, the generic, unclassed form of these types I am talking about are not intended for general use. They are for developers, for having a standard under the hood that we share. Still, it may be that it's just for me, and that's ok. There's a pattern of hierarchy that can be expressed in either recursive structures or in relational ones, and I think there's a grammar there we should aim for.

Robinlovelace commented 7 years ago

I think there is merit in having a one-stop shop for spatial tools that builds on a class system (I'd vote for sf) to prevent the duplication of functions like maptools::spRbind, raster::bind and tmap::sbind: very confusing for the new user and tricky to teach. If people (e.g. @mtennekes who has created some useful geotools that live in a viz package, as described here) were up for 'donating' their code to the community this idea could have legs. I'd be up for donating code, provided if the system named 'donators' as authors (although unsure that functions such as stplanr::gprojected are the best way of doing things - would need a peer review system which could actually be really beneficial).

I'm imagining a community-owned 'metapackage', e.g. called geotools, that could help with teaching and learning up-to-date methods for handling spatial data with open source software.

Would need a plan and a structure to build-on though, so maybe heavily refactoring an existing package, led by someone, would be the way to go.

Robinlovelace commented 7 years ago

Doh just seen geotools is taken - something like sptools then, or sftools if that's too confusingly linked to sp or simply stools or something more imaginative... Interested to hear peoples' views on this.

maelle commented 7 years ago

@Robinlovelace stools does not sound that good :grin:

Robinlovelace commented 7 years ago

Agree, especially after sleeping on it!

Think would be fine to be as a metapackage that simply pulls in other ones like tidyverse does.

barryrowlingson commented 7 years ago

Isn't this what ctv::install.views("Spatial") should do?

On Fri, Dec 9, 2016 at 9:00 AM, Robin notifications@github.com wrote:

Agree, especially after sleeping on it!

Think would be fine to be as a metapackage that simply pulls in other ones like tidyverse does.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tim-salabim/rspatial/issues/5#issuecomment-265965430, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2QlLINmL5qgS3lT-Nqvv_bnyIWpxXMks5rGRiqgaJpZM4LBssb .

edzer commented 7 years ago

tidyverse loads packages, install.views installs packages. The number of packages it installs is pretty massive -- luckily it doesn't load them all!

I'm running it now and saw that CRAN has a spatial.tools and a SpatialTools.

mtennekes commented 7 years ago

If we decide on a new (meta)package, it could be called spverse, spd (abbreviation of spatial data), or spda, (spatial data analysis). However, such a package doesn't solve the problem. tidyverse consists of packages that complement each other perfectly, without overlap. The spatial packages also complement each, but there is a huge overlap. Like @Robinlovelace pointed out, there are already (at least) 3 packages that do spatial binding, and there probably also a dozen of packages that can read shapefiles.

To build on @Robinlovelace 's idea to have a new one-stop shop: we could collaborate on a series of (new) packages with the same approach as tidyverse, with sf as base. Then, we could create/select a package for reading and writing spatial data, a package for processing spatial data, a package for visualization spatial data, a package for geostatistics, etc. Of course, we don't need to completely rewrite everything, since there are already many well written spatial functions available. Only, we need to shop in the task-view packages to find the 'best-of' functions per task, make the interface/code consist with each other, and bundle them.

rsbivand commented 7 years ago

There are ecological advantages in messiness and diversity, as developer perspectives do differ. tmap, mapview and rasterVis all do different things, maybe are specialised, but to some extent modularise their vintage (when most development took place) and intentions. It may be more robust going forward to communicate the benefits of incremental change across multiple packages. I wouldn't overdo the seamlessness of tidyverse - at some stage Hadley may wish to do something else, or to change his mind (it happens, and should be allowed to happen), without everything depending on a single coordinated collection. There is a sweet point in dynamically developing open source communities that is sufficient coordination but not more. With sp we reached this for a time, but user expectations change, and sf is a timely re-implementation of sp/rgeos/rgdal(vector only). We aren't there on raster, and more progress on that side ought to be factored in. Active comparison of use cases of task view packages would be great - Edzer or I can ask Achim if instead of just all or the core subset, we could have install.views() get a "curated" subset. If it would help, I'd be happy to share the maintenance of the Spatial task view, or to pass it on so that it could be used (much) more actively. Could for example it be edited on a small github and uploaded to R-forge svn?

tim-salabim commented 7 years ago

I agree that diversity is generally a good thing (I have been doing a lot of ecological studies lately), but for new-ish users of R spatial it can be a huge and scary mess. Therefore, a curated core of packages to get people from I to O and presentation is desirable in my opinion. Think of it maybe as in QGIS which is sufficient for most tasks on its own, yet all the other plugins (to R, GRASS or SAGA etc.) are available if needed.

rsbivand commented 7 years ago

Looking at the 11 year diff of listed packages on the Spatial task view from line 557 suggests that revising core (or alternative flavours if possible) and editing multiple channels pointing to install.views() - not just the task view - would be a way of trying things out?

hadley commented 7 years ago

I think there are also reasonable advantages to mild consistency around code style — consistent argument name/order, consistent use of camelCase vs. snake_case etc — but it's hard to get a diverse group of developers to agree on that. (Even tidyverse packages are not entirely consistent on this front as my style choices have evolved over time)

As @rsbivand has already pointed out, the most important thing is sticking to a small set of shared data structures, and recognising that as needs change over time, those data structures will also have to change.

tim-salabim commented 7 years ago

So, this repo (i.e. the document https://github.com/tim-salabim/rspatial/blob/master/rspatial_roadmap.Rmd) is meant to provide exactly such a collection of main/necessary packages relevant for meaningful (which needs to be defined somehow) spatial analysis in R. Therefore, I would urge everyone to provide their thoughts/inputs in the relevant sections of the document or by adding new sections as deemed necessary. I think this would make a good base to start thinking about an overarching core-set and steps necessary to minimise duplication/overlap and maybe a common (data) structure will evolve from this.