Dplyr compatibility? - Githubissues

tiernanmartin commented 8 years ago

I see that working with sf objects with dplyr is listed on the ISC Proposal, but it looks like you haven't gotten around to documenting that yet.

Without going into a vignette-length explanation, could you shed some light on why dplyr functions appear to strip sf objects of their sf class and suggest a way to effectively combine the power of these two tools?

Example:

library(dplyr)
library(sf)

nc <- st_read(system.file("shapes/", package="maptools"), "sids")
## Reading layer sids from data source 
## /Library/Frameworks/R.framework/Versions/3.3/Resources/library/maptools/shapes/ using driver "ESRI Shapefile"
## features:       100
## fields:         14
## converted into: MULTIPOLYGON

class(nc)
## [1] "sf"         "data.frame"

class(nc[nc$AREA > .1,])
## [1] "sf"         "data.frame"

class(dplyr::filter(nc,AREA > .1))
## [1] "data.frame"

Many thanks.

edzer commented 8 years ago

Good catch! I guess (but am not sure) that this is out of my hands; right now you could coerce back by

nc %>% filter(AREA > .1) %>% st_as_sf() %>% plot()

but this is maybe, ehm, not very tidy? @hadley what do you think?

hadley commented 8 years ago

I think it's just a matter of adding the right dplyr methods to restore the sf class, e,g.:

filter_.sf <- function(.data, ..., .dots) {
  st_as_sf(NextMethod())
}

edzer commented 8 years ago

Got it, thanks! I see the following functions ending in a _ in dplyr:

 [1] "arrange_"        "count_"          "data_frame_"     "distinct_"      
 [5] "do_"             "filter_"         "funs_"           "group_by_"      
 [9] "group_indices_"  "lst_"            "mutate_"         "mutate_each_"   
[13] "rename_"         "rename_vars_"    "select_"         "select_vars_"   
[17] "slice_"          "summarise_"      "summarise_each_" "summarize_"     
[21] "summarize_each_" "translate_sql_"  "transmute_"

@mdsumner which ones can we meaningfully support? Are there commands that need additional geometrical operations, like merging/unioning geometries?

mdsumner commented 8 years ago

I short, I don't know the answer, I haven't explored it. At a guess, group_by() %>% summarize() is the main one for geometry, I imagine that it should work like the rgeos functions in byid=FALSE mode, but applied to each grouping. If there's no fun(geom) then perhaps error, or just drop the sf classing.

In https://github.com/mdsumner/spdplyr I only do the simple ones, arrange, distinct, filter, mutate, rename, select, slice, - the group_by/summarize just bangs the geometries together without removing internal boundaries or intersections - but this was done in isolation, and without me knowing how to extend dplyr, really. Joins are another thing I haven't explored much what should/could happen there.

Sadly, despite the common "we need dplyr for Spatial" cries in the community I haven't seen any useable details about what is desired. (Personally, this is not of high interest without exposing the underlying X, Y, [Z], [M] attributes, and the underlying grouping structures in the geom in a consistent way, but to do that you need to go beyond path-based lines and polys, and I've put all of that into work elsewhere. I had hoped for there to be a common framework between ggplot/ggvis/rgl and sf, but I think that all belongs outside of sf. There's no such common framework outside of R, for example, so there's no analogous target to adhere to).

mdsumner commented 8 years ago

I'm really going to try to spend some time on this, here is just what I tried recently, without going into detail.

library(sf)
example(st_read)

## simple (non aggregation) stuff already works
library(sf)
example(st_read)

library(dplyr)
nc %>% slice(10)

nc %>% filter(PERIMETER > 2.5)

## geometry dropped as expected
nc %>% group_by(SID79) %>% summarize(sum(PERIMETER))

## grouped mutate (without aggregation) keeps geometry but drops
## sf defs
nc %>% group_by(SID79) %>% mutate(AREA = sum(AREA)) 

## trick fun to apply to the geometry column
## works for grouped union, though drops the sf defs
fun <- function(x) st_union_cascaded(st_sfc(do.call(c, st_geometry(x))))
nc %>% group_by(SID79) %>% summarize(geometry = fun(geometry))

I know this requires a lot of thought, but I think there's value in

not dropping the sf defs for a mutate into group_by (will require effort on how the grouped_df is handled)
a default aggregation behaviour, where any group_by keeps the sf defs and geometry column is unioned, even if it isn't operated on in summarize()

I haven't though deeply about the other dplyr behaviours.

edzer commented 8 years ago

UPDATE Please update to https://github.com/edzer/sfr/commit/c4dde04f4b7051d3ca173d0c56bc16c426ec6fa7 which has tested dplyr verb methods for sf objects; an overview:

  +"arrange_"       -"count_"         -"data_frame_"    +"distinct_"
  -"do_"            +"filter_"        -"funs_"          +"group_by_"
  -"group_indices_" -"lst_"           +"mutate_"        -"mutate_each_"
  +"rename_"        -"rename_vars_"   +"select_"        -"select_vars_"
  +"slice_"         +"summarise_"     -"summarise_each_"+"summarize_"
  *"summarize_each_"-"translate_sql_" +"transmute_"

from tidyr:  +gather, +spread

   +: added sf methods
   -: not relevant (?)
   *: will be deprecated?

where summarisze_ automatically merges geometries over groups (when called with a grouped_df).

tiernanmartin commented 8 years ago

May I also suggesttidyr::gather_() and tidyr::spread_() as worthy candidates for inclusion in the list of tidyverse functions that play well with sf.

edzer commented 8 years ago

Thanks; I did, but still untested.

edzer commented 8 years ago

I updated the table 3 comments up; all relevant dplyr verbs + gather & spread are now implemented and lightly tested. Below is my test script. Happy testing - positive feedback also welcome!

library(sf)
library(dplyr)
nc = st_read(system.file("shape/nc.shp", package="sf"), crs = 4267)
nc %>% filter(AREA > .1) %>% plot()

# plot 10 smallest counties in grey:
nc %>% plot()
nc %>% arrange(AREA) %>% slice(1:10) %>% plot(add = TRUE, col = 'grey')

# select: check both when geometry is part of the selection, and when not:
nc %>% select(SID74, SID79) %>% names()
nc %>% select(SID74, SID79, geometry) %>% names()
nc %>% select(SID74, SID79) %>% class()
nc %>% select(SID74, SID79, geometry) %>% class()

# arrange: ten smallest counties
nc %>% arrange(AREA) %>% slice(1:10) %>% plot(add = TRUE, col = 'grey')

# group_by:
nc$area_cl = cut(nc$AREA, c(0, .1, .12, .15, .25))
nc %>% group_by(area_cl) %>% class()

# mutate:
nc2 <- nc %>% mutate(area10 = AREA/10)

# transmute:
nc %>% transmute(AREA = AREA/10, geometry = geometry) %>% class()
nc %>% transmute(AREA = AREA/10) %>% class()

# rename:
nc2 <- nc %>% rename(area = AREA)

# distinct:
nc[c(1:100,1:10),] %>% distinct() %>% nrow()

# summarize:
nc$area_cl = cut(nc$AREA, c(0, .1, .12, .15, .25))
nc.g <- nc %>% group_by(area_cl)
nc.g %>% summarise(mean(AREA))
nc.g %>% summarize(mean(AREA)) %>% plot(col = 3:6/7)

library(tidyr)

# time-wide to long table, using tidyr::gather
# stack the two SID columns for the July 1, 1974 - June 30, 1978 and July 1, 1979 - June 30, 1984 periods
# (see https://cran.r-project.org/web/packages/spdep/vignettes/sids.pdf)
nc %>% select(SID74, SID79, geometry) %>% gather(VAR, SID, -geometry) %>% summary()

# spread:
nc$row = 1:100
nc.g <- nc %>% select(SID74, SID79, row) %>% gather(VAR, SID, -row)
nc.g %>% tail()
nc.g %>% spread(VAR, SID) %>% head()
nc %>% select(SID74, SID79, geometry, row) %>% gather(VAR, SID, -geometry, -row) %>% spread(VAR, SID) %>% head()

edzer commented 8 years ago

Here's an example computing population (well, birth) densities for aggregated areas:

library(dplyr)
library(sf)
## Linking to GEOS 3.5.0, GDAL 2.1.0
demo(nc, ask = FALSE, echo = FALSE)
## Reading layer `nc.gpkg' from data source `/home/edzer/R/x86_64-pc-linux-gnu-library/3.3/sf/gpkg/nc.gpkg' using driver `GPKG'
## features:       100
## fields:         14
## proj4string:    +proj=longlat +datum=NAD27 +no_defs
nc.ea <- st_transform(nc, 7314) # Lambert equal area
nc.ea <- nc.ea %>% mutate(area = st_area(nc.ea) / 1e6, dens = BIR74/area) # births/km^2
summary(nc.ea$dens)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01834 0.08762 0.14850 0.23860 0.27700 1.37700
nc.ea$area_cl <- cut(nc$AREA, c(0, .1, .12, .15, .25))
nc.grp <- nc.ea %>% group_by(area_cl)
out <- nc.grp %>% summarise(A = sum(area), pop = sum(dens * area), new_dens = pop/A)

did anyone get lost?

out %>% summarise(sum(A * new_dens))
##   sum(A * new_dens)
## 1            329962
nc.ea %>% summarise(sum(area * dens))
##   sum(area * dens)
## 1           329962

No. You might have discovered by now that I'm brand new to dplyr - happy to take comments how to tidy this up!

tiernanmartin commented 8 years ago

Thank you for implementing this in such short order @edzer!

When I opened this issue I had no expectation of seeing progress in the near term and had already started reverting my current project back to sp conventions. But now, thanks to your responsiveness and the feedback from others on this thread, I'm thrilled to start testing the sf x dplyr interface.

Going forward I'll be happy to share my feedback, tests, and any unusual situations I run into. As @mdsumner mentioned, there's plenty of thinking and hard work ahead.

Thanks again and I look forward to following the development of sf.

tiernanmartin commented 8 years ago

A few observations about summarise()'s interaction with sf objects:

the CRS gets dropped
the aggregated groups aren't what I expected
some geometries may be dropped

You can find tests demonstrating these observations here: https://tiernanmartin.github.io/YCC-Baseline-Conditions/3-communication/other/sp-dplyr-test.nb.html

edzer commented 8 years ago

Great observations, beautiful tests! Now, they look even much better:

CRS now get passed through by summarise (for the other commands it did work)
aggregate groups and dropped geometries have now been fixed.

The issue you discoverd was that the indices attribute in objects returned by group_by is 0-based, where I assumed it was 1-based. @hadley if you are ever going to change this, remember telling sf about it!

kendonB commented 8 years ago

I'm not sure if this deserves it's own issue, but a full set of *_join operations would be a good addition here.

I haven't dug into sfr yet, so I don't know if you're allowing sf objects to contain multiple geometries (i.e. two MULTIPOLYGONs or a MULTIPOLYGON and a MULTIPOINT). If you can, (and if you can't, I think you should), then *join operations would result in sf objects with multiple geometries.

A proposed combo:

mp1 # sf with MULTIPOLYGON
mp2 # sf with MULTIPOLYGON
mp3 <- left_join(mp1, mp2, border = FALSE)

would result in an sf with two columns geometry1 and geometry2 and would behave just like left_join in dplyr except a match is defined as two geometries with positive spatial overlap; border = TRUE would include matches with common borders. left_join(x, y) includes all observations in x, regardless of whether they match or not. Multiple matches return as multiple rows.

There are maybe more match conditions that you would want to include. Perhaps allowing any function that takes two sfg objects and returns a logical would make the most sense. Does st_intersects, for example, allow sfg objects as inputs? If so, then this could be fully general for any pair of sf types. Usage would be something like:

mp3 <- left_join(mp1, mp2, match_fun = st_touches)

It probably also makes sense to do this so you don't have to reinvent the join wheel. I don't believe dplyr allows for arbitrary conditions for join matches yet (https://github.com/hadley/dplyr/issues/557), but there might be a solution that isn't too difficult to implement.

The user could also generate new geometry columns using any function that takes two sfgs and returns an sfg. i.e.:

mp3 <- mp3 %>%
  mutate(geometry3 = purrr::map2(geometry1, geometry2, st_intersection))

tiernanmartin commented 8 years ago

@edzer Is it better to keep adding dplyr-related requests to this issue or to open fresh issues for each request?

edzer commented 8 years ago

Yes, @kendonB could you pls make this into a separate issue, for instance called "Dplyr-style join operations with spatial predicates" ?

mdsumner commented 7 years ago

@edzer this is really nice, there's some code reduction if you put the cut and the group by actions in the pipeline together:

nc.ea <- nc.ea %>% mutate(area = st_area(nc.ea) / 1e6, 
                          dens = BIR74/area, 
                          area_cl = cut(AREA, c(0, .1, .12, .15, .25)))
out <- nc.ea %>% group_by(area_cl) %>% summarise(A = sum(area), pop = sum(dens * area), new_dens = pop/A)
## compare
out %>% summarise(sum(A * new_dens))
nc.ea %>% summarise(sum(area * dens))

It reminds me that applying the st_ functions inside verbs is a motivator, and seems fine:

library(dplyr)
library(sf)
## Linking to GEOS 3.5.0, GDAL 2.1.0
demo(nc, ask = FALSE, echo = FALSE)
nc.ea <- nc %>% mutate(area_nonsense = st_area(geom), 
              geom = st_transform(geom, 7314), 
              area = st_area(geom) / 1e6, 
              dens = BIR74 / area,
              AREA_cl = cut(AREA, c(0, .1, .12, .15, .25))) 

out <- nc.ea %>%   group_by(AREA_cl) %>% 
  summarise(A = sum(area), pop = sum(dens * area), new_dens = pop/A)

out %>% summarise(sum(A * new_dens))
##sum(A * new_dens)
## 1            329962

But also I see that 7314 is Transverse Mercator, not Lambert EA - it's not so important re the example, but worth correcting in case this gets used elsewhere.

mdsumner commented 7 years ago

And a check for the grouping and area calculations sp/rgeos style:

library(rgdal)
spnc <- spTransform(as(nc, "Spatial"), "+proj=tmerc +lat_0=40.35 +lon_0=-86.15000000000001 +k=1.000031 +x_0=240000 +y_0=36000 +ellps=GRS80 +units=us-ft +no_defs")
sp.out <- rgeos::gUnionCascaded(spnc, as.character(cut(spnc$AREA, c(0, .1, .12, .15, .25))))
rgeos::gArea(sp.out, byid = TRUE)/1e6
#    (0,0.1]  (0.1,0.12] (0.12,0.15] (0.15,0.25] 
#  290620.6    182825.3    322369.5    585439.0 
out$A
##[1] 290620.6 182825.3 322369.5 585439.0

etiennebr commented 7 years ago

To follow up in relation to #121, I like the way tibble handles simple features, because they don't drop a class on select() (data.frames would do the same, though tibbles are very well behaved). Because they have no expectations of a spatial column when selecting they can just drop the geometry, which makes it coherent and explicit. It's also easy to add another spatial column.

I ran an experiment (it fails after summarise, I don't know exactly why, maybe related to #49 ?).

mstrimas commented 7 years ago

I'm noticing some potentially odd behaviour with spread(). When the geometries of features with the same key are different spread() results in a single row with only the geometry of the first instance of that key. Here's an example:

nc <- st_read(system.file("shape/nc.shp", package="sf"))
# make a long format data set
nc_gathered <- nc %>% 
  select(county = NAME, BIR74, BIR79, -geometry) %>% 
  slice(1:3) %>% 
  gather(year, births, BIR74, BIR79)
# works as expected
nc_gathered %>% 
  spread(year, births)
# change second Alleghany geometry
nc_gathered$geometry[[4]] <- nc_gathered$geometry[[2]]
nc_gathered %>% 
  spread(year, births)
# now there's only one Alleghany entry
# in contrast, if this were a normal data frame there would be 
# two Alleghany entries, one for each of the different geometries

edzer commented 7 years ago

I agree that while duplicating, and then de-duplicating geometries, there is the implicit assumption that no one messes up things in between. Why would you do this? How would you want this to work differently?

mstrimas commented 7 years ago

I don't know what the best approach is, or if this even matters, largely because I'm unclear of how spread() might be used for spatial data in the wild. Just seemed to me that the behaviour of spread_.sf() is inconsistent with spread() for data frames. Without having done much testing, wouldn't this method work:

spread_.sf <- function (data, key_col, value_col, fill = NA, convert = FALSE, 
    drop = TRUE, sep = NULL) 
{
  st_as_sf(NextMethod())
}

edzer commented 7 years ago

Gives me

> nc_gathered %>% 
+   spread(year, births)
Error in .subset2(x, i, exact = exact) : 
  attempt to select less than one element in get1index

on your example. Did you do any testing?

mstrimas commented 7 years ago

Oops, sorry, forgot a crucial line! This is actually tested and works for me.

spread_.sf <- function(data, key_col, value_col, fill = NA, 
                       convert = FALSE, drop = TRUE, sep = NULL) {
  data <- as.data.frame(data)
  st_as_sf(NextMethod())
}

Anyhow, not sure if this approach is necessarily better, just thought I'd bring it up... Thanks!

edzer commented 7 years ago

Thanks; I replace the original spread_.sf with yours.

mstrimas commented 7 years ago

Above @mdsumner gives an example of a grouped mutate. I get an error when running this example:

library(sf)
library(tidyverse)
nc <- st_read(system.file("shape/nc.shp", package="sf"))
nc %>% group_by(SID79) %>% mutate(AREA = sum(AREA))
#> Warning in is.na(st_agr(x)): is.na() applied to non-(list or vector) of
#> type 'NULL'
#> Error in .subset2(x, i, exact = exact): attempt to select less than one element in get1index

Appears to be because mutate() drops the st_column attribute, but keeps the sf class def, when run on grouped data frames (though not on normal data frames for some reason). Then st_as_sf.sf() is called instead of st_as_sf.data.frame() which doesn't add the st_column back. I have no idea if this is the best approach, but the only way I was able to fix this is by removing the sf class with:

mutate_.sf <- function(.data, ..., .dots) {
  class(.data) <- setdiff(class(.data), "sf")
  st_as_sf(NextMethod())
}

Grouped filter() does not suffer from this issue.

edzer commented 7 years ago

Clumsy as it looks, I agree that might be the best thing to do. Thanks!

eivindhammers commented 7 years ago

Any reason why class(.data) <- setdiff(class(.data), "sf") was only added to group_by() in f51b9a3?

I'm still getting Error in .subset2(x, i, exact = exact) : attempt to select less than one element in get1index.

edzer commented 7 years ago

We might have to do it everywhere. Does it solve your problem? With which function is that?

eivindhammers commented 7 years ago

It does solve it. mutate()

yutannihilation commented 7 years ago

I know sf is on the mid-way toward achieving compatibility with dplyr, but I'm a bit afraid the compatibility will be degraded with the next release of dplyr. For example, sf object loses its class (sf) with dplyr 0.6.0rc. In case you've not noticed yet, I report the issue here:

0.5.0(current):

library(sf)
#> Linking to GEOS 3.5.0, GDAL 2.1.1, proj.4 4.9.3
library(dplyr, warn.conflicts = FALSE)
nc <- st_read(system.file("shape/nc.shp", package="sf"), quiet = TRUE)
nc %>% mutate(NEW_AREA = AREA/ max(AREA)) %>% class
#> [1] "sf"         "data.frame"

packageVersion("dplyr")
#> [1] '0.5.0'

RC for 0.6.0:

library(sf)
#> Linking to GEOS 3.5.0, GDAL 2.1.1, proj.4 4.9.3
library(dplyr, warn.conflicts = FALSE)
nc <- st_read(system.file("shape/nc.shp", package="sf"), quiet = TRUE)
nc %>% mutate(NEW_AREA = AREA/ max(AREA)) %>% class
#> [1] "data.frame"

packageVersion("dplyr")
#> [1] '0.5.0.9002'

One more thing I want to ask is, while dplyr has a plan to deprecate SE functions (e.g. select_, filter_, ...), do you plan to introduce tidyeval?

yutannihilation commented 7 years ago

Ah, my comment above may be duplicated with #304.

edzer commented 7 years ago

There is no next release of dplyr yet, and there are no signs that @hadley has done any reverse dependency checks so far. The rstudio blog mentions that the verb_ methods will be deprecated but remain around for backward compatibility. This is, as you found out, not yet the case, so I consider your worries a little premature. sf is not mid-way something, dplyr is.

yutannihilation commented 7 years ago

Oh, I see. thanks for your quick response:) I will consider filing this issue to dplyr's repo.

hadley commented 7 years ago

We have been careful not to make any backward incompatible changes - we should have a system of default methods that ensures existing backends still work. If that isn't true, I'd really appreciate a minimal reprex filed in dplyr that illustrates the problem.

@edzer revdep emails will go out (hopefully) later today.

edzer commented 7 years ago

Scripts that use dplyr verbs such as mutate (see 5 comments above for a reprex) no longer work with packages that dispatch on verb_ (mutate_), which is the only way dplyr 0.5 provides. To be compatible, users now need to change mutate into mutate_ in their scripts, did you have in mind to recommend them doing that?

I don't mind modifying sf, but consider it a backward incompatible change.

hadley commented 7 years ago

That is not the intent - can you please file a bug on dplyr?

edzer commented 7 years ago

OK: https://github.com/tidyverse/dplyr/issues/2664

r-spatial / sf

Dplyr compatibility? #42