Closed tiernanmartin closed 7 years ago
Good catch! I guess (but am not sure) that this is out of my hands; right now you could coerce back by
nc %>% filter(AREA > .1) %>% st_as_sf() %>% plot()
but this is maybe, ehm, not very tidy? @hadley what do you think?
I think it's just a matter of adding the right dplyr methods to restore the sf class, e,g.:
filter_.sf <- function(.data, ..., .dots) {
st_as_sf(NextMethod())
}
Got it, thanks! I see the following functions ending in a _
in dplyr:
[1] "arrange_" "count_" "data_frame_" "distinct_"
[5] "do_" "filter_" "funs_" "group_by_"
[9] "group_indices_" "lst_" "mutate_" "mutate_each_"
[13] "rename_" "rename_vars_" "select_" "select_vars_"
[17] "slice_" "summarise_" "summarise_each_" "summarize_"
[21] "summarize_each_" "translate_sql_" "transmute_"
@mdsumner which ones can we meaningfully support? Are there commands that need additional geometrical operations, like merging/unioning geometries?
I short, I don't know the answer, I haven't explored it. At a guess, group_by() %>% summarize() is the main one for geometry, I imagine that it should work like the rgeos functions in byid=FALSE mode, but applied to each grouping. If there's no fun(geom) then perhaps error, or just drop the sf classing.
In https://github.com/mdsumner/spdplyr I only do the simple ones, arrange, distinct, filter, mutate, rename, select, slice, - the group_by/summarize just bangs the geometries together without removing internal boundaries or intersections - but this was done in isolation, and without me knowing how to extend dplyr, really. Joins are another thing I haven't explored much what should/could happen there.
Sadly, despite the common "we need dplyr for Spatial" cries in the community I haven't seen any useable details about what is desired. (Personally, this is not of high interest without exposing the underlying X, Y, [Z], [M] attributes, and the underlying grouping structures in the geom in a consistent way, but to do that you need to go beyond path-based lines and polys, and I've put all of that into work elsewhere. I had hoped for there to be a common framework between ggplot/ggvis/rgl and sf, but I think that all belongs outside of sf. There's no such common framework outside of R, for example, so there's no analogous target to adhere to).
I'm really going to try to spend some time on this, here is just what I tried recently, without going into detail.
library(sf)
example(st_read)
## simple (non aggregation) stuff already works
library(sf)
example(st_read)
library(dplyr)
nc %>% slice(10)
nc %>% filter(PERIMETER > 2.5)
## geometry dropped as expected
nc %>% group_by(SID79) %>% summarize(sum(PERIMETER))
## grouped mutate (without aggregation) keeps geometry but drops
## sf defs
nc %>% group_by(SID79) %>% mutate(AREA = sum(AREA))
## trick fun to apply to the geometry column
## works for grouped union, though drops the sf defs
fun <- function(x) st_union_cascaded(st_sfc(do.call(c, st_geometry(x))))
nc %>% group_by(SID79) %>% summarize(geometry = fun(geometry))
I know this requires a lot of thought, but I think there's value in
I haven't though deeply about the other dplyr behaviours.
UPDATE
Please update to https://github.com/edzer/sfr/commit/c4dde04f4b7051d3ca173d0c56bc16c426ec6fa7 which has tested dplyr
verb methods for sf
objects; an overview:
+"arrange_" -"count_" -"data_frame_" +"distinct_"
-"do_" +"filter_" -"funs_" +"group_by_"
-"group_indices_" -"lst_" +"mutate_" -"mutate_each_"
+"rename_" -"rename_vars_" +"select_" -"select_vars_"
+"slice_" +"summarise_" -"summarise_each_"+"summarize_"
*"summarize_each_"-"translate_sql_" +"transmute_"
from tidyr: +gather, +spread
+: added sf methods
-: not relevant (?)
*: will be deprecated?
where summarisze_
automatically merges geometries over groups (when called with a grouped_df
).
May I also suggesttidyr::gather_()
and tidyr::spread_()
as worthy candidates for inclusion in the list of tidyverse
functions that play well with sf
.
Thanks; I did, but still untested.
I updated the table 3 comments up; all relevant dplyr verbs + gather & spread are now implemented and lightly tested. Below is my test script. Happy testing - positive feedback also welcome!
library(sf)
library(dplyr)
nc = st_read(system.file("shape/nc.shp", package="sf"), crs = 4267)
nc %>% filter(AREA > .1) %>% plot()
# plot 10 smallest counties in grey:
nc %>% plot()
nc %>% arrange(AREA) %>% slice(1:10) %>% plot(add = TRUE, col = 'grey')
# select: check both when geometry is part of the selection, and when not:
nc %>% select(SID74, SID79) %>% names()
nc %>% select(SID74, SID79, geometry) %>% names()
nc %>% select(SID74, SID79) %>% class()
nc %>% select(SID74, SID79, geometry) %>% class()
# arrange: ten smallest counties
nc %>% arrange(AREA) %>% slice(1:10) %>% plot(add = TRUE, col = 'grey')
# group_by:
nc$area_cl = cut(nc$AREA, c(0, .1, .12, .15, .25))
nc %>% group_by(area_cl) %>% class()
# mutate:
nc2 <- nc %>% mutate(area10 = AREA/10)
# transmute:
nc %>% transmute(AREA = AREA/10, geometry = geometry) %>% class()
nc %>% transmute(AREA = AREA/10) %>% class()
# rename:
nc2 <- nc %>% rename(area = AREA)
# distinct:
nc[c(1:100,1:10),] %>% distinct() %>% nrow()
# summarize:
nc$area_cl = cut(nc$AREA, c(0, .1, .12, .15, .25))
nc.g <- nc %>% group_by(area_cl)
nc.g %>% summarise(mean(AREA))
nc.g %>% summarize(mean(AREA)) %>% plot(col = 3:6/7)
library(tidyr)
# time-wide to long table, using tidyr::gather
# stack the two SID columns for the July 1, 1974 - June 30, 1978 and July 1, 1979 - June 30, 1984 periods
# (see https://cran.r-project.org/web/packages/spdep/vignettes/sids.pdf)
nc %>% select(SID74, SID79, geometry) %>% gather(VAR, SID, -geometry) %>% summary()
# spread:
nc$row = 1:100
nc.g <- nc %>% select(SID74, SID79, row) %>% gather(VAR, SID, -row)
nc.g %>% tail()
nc.g %>% spread(VAR, SID) %>% head()
nc %>% select(SID74, SID79, geometry, row) %>% gather(VAR, SID, -geometry, -row) %>% spread(VAR, SID) %>% head()
Here's an example computing population (well, birth) densities for aggregated areas:
library(dplyr)
library(sf)
## Linking to GEOS 3.5.0, GDAL 2.1.0
demo(nc, ask = FALSE, echo = FALSE)
## Reading layer `nc.gpkg' from data source `/home/edzer/R/x86_64-pc-linux-gnu-library/3.3/sf/gpkg/nc.gpkg' using driver `GPKG'
## features: 100
## fields: 14
## proj4string: +proj=longlat +datum=NAD27 +no_defs
nc.ea <- st_transform(nc, 7314) # Lambert equal area
nc.ea <- nc.ea %>% mutate(area = st_area(nc.ea) / 1e6, dens = BIR74/area) # births/km^2
summary(nc.ea$dens)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01834 0.08762 0.14850 0.23860 0.27700 1.37700
nc.ea$area_cl <- cut(nc$AREA, c(0, .1, .12, .15, .25))
nc.grp <- nc.ea %>% group_by(area_cl)
out <- nc.grp %>% summarise(A = sum(area), pop = sum(dens * area), new_dens = pop/A)
did anyone get lost?
out %>% summarise(sum(A * new_dens))
## sum(A * new_dens)
## 1 329962
nc.ea %>% summarise(sum(area * dens))
## sum(area * dens)
## 1 329962
No. You might have discovered by now that I'm brand new to dplyr
- happy to take comments how to tidy this up!
Thank you for implementing this in such short order @edzer!
When I opened this issue I had no expectation of seeing progress in the near term and had already started reverting my current project back to sp
conventions. But now, thanks to your responsiveness and the feedback from others on this thread, I'm thrilled to start testing the sf
x dplyr
interface.
Going forward I'll be happy to share my feedback, tests, and any unusual situations I run into. As @mdsumner mentioned, there's plenty of thinking and hard work ahead.
Thanks again and I look forward to following the development of sf
.
A few observations about summarise()
's interaction with sf
objects:
You can find tests demonstrating these observations here: https://tiernanmartin.github.io/YCC-Baseline-Conditions/3-communication/other/sp-dplyr-test.nb.html
Great observations, beautiful tests! Now, they look even much better:
summarise
(for the other commands it did work)The issue you discoverd was that the indices
attribute in objects returned by group_by
is 0-based, where I assumed it was 1-based. @hadley if you are ever going to change this, remember telling sf about it!
I'm not sure if this deserves it's own issue, but a full set of *_join
operations would be a good addition here.
I haven't dug into sfr
yet, so I don't know if you're allowing sf
objects to contain multiple geometries (i.e. two MULTIPOLYGON
s or a MULTIPOLYGON
and a MULTIPOINT
). If you can, (and if you can't, I think you should), then *join operations would result in sf
objects with multiple geometries.
A proposed combo:
mp1 # sf with MULTIPOLYGON
mp2 # sf with MULTIPOLYGON
mp3 <- left_join(mp1, mp2, border = FALSE)
would result in an sf
with two columns geometry1 and geometry2 and would behave just like left_join in dplyr except a match is defined as two geometries with positive spatial overlap; border = TRUE
would include matches with common borders. left_join(x, y)
includes all observations in x, regardless of whether they match or not. Multiple matches return as multiple rows.
There are maybe more match conditions that you would want to include. Perhaps allowing any function that takes two sfg
objects and returns a logical would make the most sense. Does st_intersects
, for example, allow sfg
objects as inputs? If so, then this could be fully general for any pair of sf
types. Usage would be something like:
mp3 <- left_join(mp1, mp2, match_fun = st_touches)
It probably also makes sense to do this so you don't have to reinvent the join wheel. I don't believe dplyr allows for arbitrary conditions for join matches yet (https://github.com/hadley/dplyr/issues/557), but there might be a solution that isn't too difficult to implement.
The user could also generate new geometry columns using any function that takes two sfg
s and returns an sfg
. i.e.:
mp3 <- mp3 %>%
mutate(geometry3 = purrr::map2(geometry1, geometry2, st_intersection))
@edzer Is it better to keep adding dplyr
-related requests to this issue or to open fresh issues for each request?
Yes, @kendonB could you pls make this into a separate issue, for instance called "Dplyr-style join operations with spatial predicates" ?
@edzer this is really nice, there's some code reduction if you put the cut and the group by actions in the pipeline together:
nc.ea <- nc.ea %>% mutate(area = st_area(nc.ea) / 1e6,
dens = BIR74/area,
area_cl = cut(AREA, c(0, .1, .12, .15, .25)))
out <- nc.ea %>% group_by(area_cl) %>% summarise(A = sum(area), pop = sum(dens * area), new_dens = pop/A)
## compare
out %>% summarise(sum(A * new_dens))
nc.ea %>% summarise(sum(area * dens))
It reminds me that applying the st_ functions inside verbs is a motivator, and seems fine:
library(dplyr)
library(sf)
## Linking to GEOS 3.5.0, GDAL 2.1.0
demo(nc, ask = FALSE, echo = FALSE)
nc.ea <- nc %>% mutate(area_nonsense = st_area(geom),
geom = st_transform(geom, 7314),
area = st_area(geom) / 1e6,
dens = BIR74 / area,
AREA_cl = cut(AREA, c(0, .1, .12, .15, .25)))
out <- nc.ea %>% group_by(AREA_cl) %>%
summarise(A = sum(area), pop = sum(dens * area), new_dens = pop/A)
out %>% summarise(sum(A * new_dens))
##sum(A * new_dens)
## 1 329962
But also I see that 7314 is Transverse Mercator, not Lambert EA - it's not so important re the example, but worth correcting in case this gets used elsewhere.
And a check for the grouping and area calculations sp/rgeos style:
library(rgdal)
spnc <- spTransform(as(nc, "Spatial"), "+proj=tmerc +lat_0=40.35 +lon_0=-86.15000000000001 +k=1.000031 +x_0=240000 +y_0=36000 +ellps=GRS80 +units=us-ft +no_defs")
sp.out <- rgeos::gUnionCascaded(spnc, as.character(cut(spnc$AREA, c(0, .1, .12, .15, .25))))
rgeos::gArea(sp.out, byid = TRUE)/1e6
# (0,0.1] (0.1,0.12] (0.12,0.15] (0.15,0.25]
# 290620.6 182825.3 322369.5 585439.0
out$A
##[1] 290620.6 182825.3 322369.5 585439.0
To follow up in relation to #121, I like the way tibble
handles simple features, because they don't drop a class on select()
(data.frames would do the same, though tibbles are very well behaved). Because they have no expectations of a spatial column when selecting they can just drop the geometry, which makes it coherent and explicit. It's also easy to add another spatial column.
I ran an experiment (it fails after summarise
, I don't know exactly why, maybe related to #49 ?).
I'm noticing some potentially odd behaviour with spread()
. When the geometries of features with the same key are different spread()
results in a single row with only the geometry of the first instance of that key. Here's an example:
nc <- st_read(system.file("shape/nc.shp", package="sf"))
# make a long format data set
nc_gathered <- nc %>%
select(county = NAME, BIR74, BIR79, -geometry) %>%
slice(1:3) %>%
gather(year, births, BIR74, BIR79)
# works as expected
nc_gathered %>%
spread(year, births)
# change second Alleghany geometry
nc_gathered$geometry[[4]] <- nc_gathered$geometry[[2]]
nc_gathered %>%
spread(year, births)
# now there's only one Alleghany entry
# in contrast, if this were a normal data frame there would be
# two Alleghany entries, one for each of the different geometries
I agree that while duplicating, and then de-duplicating geometries, there is the implicit assumption that no one messes up things in between. Why would you do this? How would you want this to work differently?
I don't know what the best approach is, or if this even matters, largely because I'm unclear of how spread()
might be used for spatial data in the wild. Just seemed to me that the behaviour of spread_.sf()
is inconsistent with spread()
for data frames. Without having done much testing, wouldn't this method work:
spread_.sf <- function (data, key_col, value_col, fill = NA, convert = FALSE,
drop = TRUE, sep = NULL)
{
st_as_sf(NextMethod())
}
Gives me
> nc_gathered %>%
+ spread(year, births)
Error in .subset2(x, i, exact = exact) :
attempt to select less than one element in get1index
on your example. Did you do any testing?
Oops, sorry, forgot a crucial line! This is actually tested and works for me.
spread_.sf <- function(data, key_col, value_col, fill = NA,
convert = FALSE, drop = TRUE, sep = NULL) {
data <- as.data.frame(data)
st_as_sf(NextMethod())
}
Anyhow, not sure if this approach is necessarily better, just thought I'd bring it up... Thanks!
Thanks; I replace the original spread_.sf
with yours.
Above @mdsumner gives an example of a grouped mutate. I get an error when running this example:
library(sf)
library(tidyverse)
nc <- st_read(system.file("shape/nc.shp", package="sf"))
nc %>% group_by(SID79) %>% mutate(AREA = sum(AREA))
#> Warning in is.na(st_agr(x)): is.na() applied to non-(list or vector) of
#> type 'NULL'
#> Error in .subset2(x, i, exact = exact): attempt to select less than one element in get1index
Appears to be because mutate()
drops the st_column
attribute, but keeps the sf
class def, when run on grouped data frames (though not on normal data frames for some reason). Then st_as_sf.sf()
is called instead of st_as_sf.data.frame()
which doesn't add the st_column
back. I have no idea if this is the best approach, but the only way I was able to fix this is by removing the sf
class with:
mutate_.sf <- function(.data, ..., .dots) {
class(.data) <- setdiff(class(.data), "sf")
st_as_sf(NextMethod())
}
Grouped filter()
does not suffer from this issue.
Clumsy as it looks, I agree that might be the best thing to do. Thanks!
Any reason why class(.data) <- setdiff(class(.data), "sf")
was only added to group_by()
in f51b9a3?
I'm still getting Error in .subset2(x, i, exact = exact) : attempt to select less than one element in get1index
.
We might have to do it everywhere. Does it solve your problem? With which function is that?
It does solve it. mutate()
I know sf is on the mid-way toward achieving compatibility with dplyr, but I'm a bit afraid the compatibility will be degraded with the next release of dplyr. For example, sf
object loses its class (sf
) with dplyr 0.6.0rc. In case you've not noticed yet, I report the issue here:
0.5.0(current):
library(sf)
#> Linking to GEOS 3.5.0, GDAL 2.1.1, proj.4 4.9.3
library(dplyr, warn.conflicts = FALSE)
nc <- st_read(system.file("shape/nc.shp", package="sf"), quiet = TRUE)
nc %>% mutate(NEW_AREA = AREA/ max(AREA)) %>% class
#> [1] "sf" "data.frame"
packageVersion("dplyr")
#> [1] '0.5.0'
RC for 0.6.0:
library(sf)
#> Linking to GEOS 3.5.0, GDAL 2.1.1, proj.4 4.9.3
library(dplyr, warn.conflicts = FALSE)
nc <- st_read(system.file("shape/nc.shp", package="sf"), quiet = TRUE)
nc %>% mutate(NEW_AREA = AREA/ max(AREA)) %>% class
#> [1] "data.frame"
packageVersion("dplyr")
#> [1] '0.5.0.9002'
One more thing I want to ask is, while dplyr has a plan to deprecate SE functions (e.g. select_
, filter_
, ...), do you plan to introduce tidyeval?
Ah, my comment above may be duplicated with #304.
There is no next release of dplyr yet, and there are no signs that @hadley has done any reverse dependency checks so far. The rstudio blog mentions that the verb_
methods will be deprecated but remain around for backward compatibility. This is, as you found out, not yet the case, so I consider your worries a little premature. sf
is not mid-way something, dplyr
is.
Oh, I see. thanks for your quick response:) I will consider filing this issue to dplyr's repo.
We have been careful not to make any backward incompatible changes - we should have a system of default methods that ensures existing backends still work. If that isn't true, I'd really appreciate a minimal reprex filed in dplyr that illustrates the problem.
@edzer revdep emails will go out (hopefully) later today.
Scripts that use dplyr verbs such as mutate
(see 5 comments above for a reprex) no longer work with packages that dispatch on verb_
(mutate_
), which is the only way dplyr 0.5 provides. To be compatible, users now need to change mutate
into mutate_
in their scripts, did you have in mind to recommend them doing that?
I don't mind modifying sf
, but consider it a backward incompatible change.
That is not the intent - can you please file a bug on dplyr?
I see that working with
sf
objects withdplyr
is listed on the ISC Proposal, but it looks like you haven't gotten around to documenting that yet.Without going into a vignette-length explanation, could you shed some light on why
dplyr
functions appear to stripsf
objects of theirsf
class and suggest a way to effectively combine the power of these two tools?Example:
Many thanks.