r-spatial / sf

Simple Features for R
https://r-spatial.github.io/sf/
Other
1.34k stars 298 forks source link

Error on st_union() in summarise() #131

Closed etiennebr closed 7 years ago

etiennebr commented 7 years ago

st_union() seem to work fine, but in summarise() it creates a weird geometry that it can't read. I tried to dump the geometry to wkt (quite large, so I won't put it here), and it can display on wicket, but vertex order is slightly wrong. I think it might be how it handled the multi parts polygons.

library(dplyr)
library(tibble)

nc <- st_read(system.file("shape/nc.shp", package="sf"))

nc %>%
  st_area() %>%
  sum
# 127031757146 m^2

as(nc$geometry, "Spatial") %>% 
  rgeos::gUnionCascaded(nc$SID79) %>%
  st_as_sf() %>% 
  st_area %>% 
  sum
# 127031757076 m^2

x <- lapply(unique(nc$SID79), function(i) {
  nc %>%
    dplyr::filter(SID79 == i) %>% 
    summarise(geog = st_union(geometry), area = st_area(geog))
})
# It would be nice to have something like row_bind(x) %>% st_area() %>% sum()
sum(sapply(x, function(x) x[["area"]]))
# [1] 127031757146

nc %>%
  ungroup() %>%
  summarise(geog = st_union(geometry), area = st_area(geog))
#                             geog             area
# 1 MULTIPOLYGON(((-76.54427337... 127031757146 m^2

But it fails with summarise

nc %>%
  group_by(SID79) %>% 
  summarise(geog = st_union(geometry), area = st_area(geog))
# Error in summarise_impl(.data, dots) : 
# no applicable method for 'st_geometry' applied to an object of class "list" 

To dump the geometry:

nc %>% as_tibble %>%
  group_by(SID79) %>% 
  summarise(geog = st_union(geometry)) %$%
  st_as_text(geog) %>%
  cat(file="geog.wkt")
etiennebr commented 7 years ago

I see, this is taken care of in summarise_.sf.

edzer commented 7 years ago

If you

nc %>% as_tibble %>% st_as_sf %>% class()
# [1] "sf"         "tbl_df"     "tbl"        "data.frame"

you get an object of class sf, tbl_df etc. Is that a good idea? Does that help for anything?

etiennebr commented 7 years ago

Seems to be related to the geometries.

Works :

library(sf)
library(tibble)
library(dplyr)
library(magrittr)
nc <- st_read(system.file("shape/nc.shp", package="sf"), quiet = TRUE)
nc %>% 
  as_tibble %>% 
  filter(SID79 < 10) %>% 
  group_by(SID79) %>% 
  summarise(geog = st_union(geometry)) %>% 
  mutate(area = st_area(geog))
# # A tibble: 10 × 3
# SID79              geog            area
# <dbl>              <sf>     <S3: units>
# 1      0 <MULTIPOLYGON...>  8473635999 m^2
# 2      1 <MULTIPOLYGON...>  7871653802 m^2
# 3      2 <MULTIPOLYGON...>  9608907418 m^2
# 4      3 <MULTIPOLYGON...>  8136026436 m^2
# 5      4 <MULTIPOLYGON...> 11879964776 m^2
# 6      5 <MULTIPOLYGON...> 13623569968 m^2
# 7      6 <MULTIPOLYGON...>  6992413837 m^2
# 8      7 <MULTIPOLYGON...>  8542520112 m^2
# 9      8 <MULTIPOLYGON...>  8000734201 m^2
# 10     9 <MULTIPOLYGON...>  4615845685 m^2

But,

nc %>% 
  as_tibble %>% 
  filter(SID79 < 11) %>% 
  group_by(SID79) %>% 
  summarise(geog = st_union(geography)) %>% 
  mutate(area = st_area(geog))
# Error in mutate_impl(.data, dots) : 
#   unable to find an inherited method for function ‘coordinates’ for signature ‘"numeric"’
edzer commented 7 years ago

I don't understand the traceback() of that.

etiennebr commented 7 years ago

Ah! I think it has to do with POLYGON vs MULTIPOLYGON. e.g. : SID79:1 is MULTIPOLYGON (9 features), but SID79:10 is POLYGON (1 feature)

nc %>% 
  as_tibble %>% 
  filter(SID79 %in% c(1, 10)) %>% 
  group_by(SID79) %>% 
  summarise(geog = st_union(geography)) %>% 
  mutate(area = st_area(geog))
# Error in CPL_get_bbox(obj, 3) : not a matrix 

SID79:12 also has a single feature

nc %>% 
  as_tibble %>% 
  filter(SID79 %in% c(10, 12)) %>% 
  group_by(SID79) %>% 
  summarise(geog = st_union(geography)) %>% 
  mutate(area = st_area(geog))
# # A tibble: 2 × 3
#   SID79              geog           area
#   <dbl>              <sf>    <S3: units>
# 1    10 <POLYGON((-78...> 1550981080 m^2
# 2    12 <POLYGON((-79...> 2019778925 m^2
x <- nc %>% 
  as_tibble %>% 
  filter(SID79 %in% c(10, 11)) %>% 
  group_by(SID79) %>% 
  summarise(geog = st_union(geography))

str(unclass(x$geog))
# List of 2
#  $ :List of 1
#   ..$ : num [1:22, 1:2] -78.6 -78.7 -78.8 -78.9 -78.9 ...
#   ..- attr(*, "class")= chr [1:3] "XY" "POLYGON" "sfg"
#  $ :List of 2
#   ..$ :List of 1
#   .. ..$ : num [1:45, 1:2] -77.5 -77.5 -77.5 -77.5 -77.5 ...
#   ..$ :List of 1
#   .. ..$ : num [1:10, 1:2] -79.2 -79.2 -79.5 -79.5 -79.5 ...
#   ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
#  - attr(*, "n_empty")= int 0
#  - attr(*, "precision")= num 0
#  - attr(*, "bbox")= Named num [1:4] -79.2 35.2 -78.5 35.6
#   ..- attr(*, "names")= chr [1:4] "xmin" "ymin" "xmax" "ymax"
#  - attr(*, "crs")=List of 2
#   ..$ epsg       : int 4267
#   ..$ proj4string: chr "+proj=longlat +datum=NAD27 +no_defs"
#   ..- attr(*, "class")= chr "crs"

Even more surprising

mutate(x, area=st_area(geog))
# Error in CPL_get_bbox(obj, 2) : not compatible with requested type 

But

st_area(x$geog)
# Units: m^2
# [1] 1550981080  575180854
etiennebr commented 7 years ago

I'm getting deeper in sf, leaving breadcrumbs in case @edzer you see I'm missing something obvious.

The error happens when printing the sfc. Traceback:

17. structure(list(message = "not compatible with requested type", 
    call = CPL_get_bbox(obj, 2), cppstack = NULL), .Names = c("message", 
"call", "cppstack"), class = c("Rcpp::not_compatible", "C++Error", 
"error", "condition"))) 
16. get_bbox(obj, 2) 
15. CPL_get_bbox(obj, 2), names = c("xmin", "ymin", "xmax", 
    "ymax")) at bbox.R#20
14. bbox.sfc_POLYGON(x) at bbox.R#39
13. bbox(x) at sfc.R#107
12. sfc`(X[[i]], ...) 
11. FUN(X[[i]], ...) 
10. lapply(x, `[`, i) 

So this works fine (dropping the geog column)

transmute(x, area=st_area(geog))
# # A tibble: 2 × 1
#             area
#      <S3: units>
# 1 1550981080 m^2
# 2  575180854 m^2

Also using your suggestion @edzer : x %>% st_as_sf() prints fine

etiennebr commented 7 years ago

There's something happening in summarise, after it evaluates all the arguments. The correct class for POLYGON and MULTIPOLYGON is GEOMETRY, but when the st_sfc conversion happens within summarise it thinks it's MULTIPOLYGON.

# ok
nc %>% 
  as_tibble %>% 
  group_by(SID79) %>% 
  summarise(geog = st_union(geography)) %>% 
  mutate(geog = st_sfc(geog)) %$% 
  class(geog)
# [1] "sfc_GEOMETRY" "sfc"   

# not ok
nc %>% 
  as_tibble %>% 
  group_by(SID79) %>% 
  summarise(geog = st_sfc(st_union(geography))) %$% 
  class(geog)
# [1] "sfc_MULTIPOLYGON" "sfc" 

Same thing if st_sfc is in a separate argument, or if it's not a tibble.

nc %>% 
  as_tibble %>% 
  group_by(SID79) %>% 
  summarise(geog = st_union(geography),
            geog = st_sfc(geog)) %$% 
  class(geog)
# [1] "sfc_MULTIPOLYGON" "sfc" 

nc %>% 
  group_by(SID79) %>% 
  summarise(geog = st_union(geometry)) %$% 
  class(geog)
# [1] "sfc_MULTIPOLYGON" "sfc" 
edzer commented 7 years ago

Great progress. st_sfc tries to set the class of the collection of geometries right, but it seems that summarize calls st_sfc for each group, and then binds the groups together without calling st_sfc on the total; mutate does this, as it operates on the whole list-column. Can we find out how summarize binds groups, and interfere there?

edzer commented 7 years ago

An alternative, more hacky approach would be to make summarise_.sf always the "right" class, instead of the simplest class that st_union apparently returns; here st_cast might come in handy, @mdsumner ? Of course this would only work if summarise_.sf is being called; after piping through as_tibble this is no longer the case? Another reason to maintain the sf class attribute?

etiennebr commented 7 years ago

I think there are two issues. First is, I didn't expect summarise to attempt to take care of the geometry merge with st_union, which created two geometries. Once this is clear, it would be possible to allow overriding st_union, but I believe it's a very good default.

nc %>%
    group_by(SID79) %>% 
    summarise(geog = st_union(geometry))
# Simple feature collection with 28 features and 1 field
# geometry type:  MULTIPOLYGON
# dimension:      XY
# bbox:           xmin: -83.98855 ymin: 34.98907 xmax: -75.7637 ymax: 36.58965
# epsg (SRID):    4267
# proj4string:    +proj=longlat +datum=NAD27 +no_defs
#    SID79                           geog
# 1      0 MULTIPOLYGON(((-83.93799591...
# 2      1 MULTIPOLYGON(((-84.29103851...
# 3      2 MULTIPOLYGON(((-77.04900360...
# 4      3 MULTIPOLYGON(((-78.02592468...
# ...
# 11    10 POLYGON((-78.6127395629883 ...
# 12    11 MULTIPOLYGON(((-77.47388458...
# 13    12 POLYGON((-79.7649917602539 ...
# 14    13 POLYGON((-78.5387420654297 ...
# 15    14 POLYGON((-77.8365783691406 ...
# 16    15 POLYGON((-81.8162841796875 ...
# 17    16 POLYGON((-79.4559707641602 ...
# Warning message:
# In st_sf(x, ..., relation_to_geometry = relation_to_geometry) :
#   more than one geometry column: ignoring all but first

The other issue seems to be that the geometry type ends up being MULTIPOLYGON, but should be GEOMETRY (in this specific case). I think that's the real problem that has multiple symptoms.

etiennebr commented 7 years ago

To answer your question on sf. This is a larger topic, but it's a good question to ask, might even require to open another issue. I've been thinking about sf class attributes, and before leaving for the holidays I wanted to write a few thoughts. I think it is very convenient and makes it more similar to sp. Basically, it makes it straightforward to manipulate a data.frame without having to care for the spatial attributes all the time.

I don't see it as a bad thing to have the user explicitly care for the spatial attributes. It might even make it easier in the long run to have the user understand what is going on and explicitly ask for it.

On the development side, I think it offers additional control (like it would in this case), but also put the burden of having good defaults that always end up frustrating half of the users :)

It could be easier to develop and interact with other packages to focus on sfc than sf functionnality. For instance I can envision geom_sf() (#88) operating on a sfc rather than a sf.

nc %>% 
  ggplot(aes(sf=geom, fill=SID79)) +
  geom_sf()  # or some other name :)

However while I've dug deep in some parts of sf, I don't feel like I have yet a broad understanding of the package (lots of stuff there !), so maybe I miss a piece of the picture.

mdsumner commented 7 years ago

I'm not sure about a new name for the geometry column after a groupby/summarize, I think that means summarise.sf must check at the end of evaluation and update sf_column to an existing sfc? I feel I don't have enough experience with S3, or with extensions to dataframes generally.

But, I think it's clear that class(as_tibble(sf)) should be c("tbl_df", "tbl", "sf", "data.frame"), i.e. respecting what was passed in?

edzer commented 7 years ago

In that case, sf needs to provide as_tibble.sf, which it doesn't, right now.

Randall-HYLA commented 3 years ago

Hi, I am having a problem that I think is related to this trend. I have a huge set of polygons (birds dataset) that I need to do an union for some of them in the dataset (birds polygons with same id_no). I have been able to do the union of some these polygons but other polygons seems to be crashing the process (I started to do data subsets with data that works). I think the crashes are happening because some polygons have distinct geometry type in the dataset. I have been getting this error and warning message when I have some of these polygons in the sf dataset:

Error: Problem with summarise() input geometry. x Evaluation error: ParseException: Unknown WKB type 12. i Input geometry is sf::st_union(geometry). i The error occurred in group 438: id_no = "22709319". Run rlang::last_error() to see where the error occurred. In addition: Warning message: In st_is_longlat(x) : bounding box has potentially an invalid value range for longlat data

This is the code I have been using to do the union:

bird_union_sf <- bird_sf %>% group_by(id_no) %>% summarise(geometry = sf::st_union(geometry)) %>% ungroup()

I read on other blogs from you that st_cast might be a solution, but I dont know how to apply it in my code. Any hints would be totally appreciated. Pd. This data set comes from a gdbtable file

etiennebr commented 3 years ago

Thanks for reporting this issue @Randall-HYLA. I suggest opening a new issue and link to this (closed) issue. Also, please post a reproducible minimal example.

The error message points toward at an invalid geometry (WKB type 12) in group 438 id_no = "22709319". Maybe this is the data you could share with your reprex. Since the error states that the bounding box has potentially an invalid value range for longlat data, I suspect you might have data that wraps around the ±180° line, or maybe not long+lat data.

Randall-HYLA commented 3 years ago

Thank you for your suggestions. @etiennebr, I was able to solve the problem at the end. The problem had to do with the dataset. I had Multisurface geometries and Multipolygon geometries on the same sf object. I ended up converting all the multisurface geometries to Multipolygon with st_cast() and them validating those. The validation part is TOTALLY necessary (topological correction), and depends on the version of sf and the linked GDAL and GEOS software. Here is the code I used to fix it before running the union:

bird_sf <- st_cast(bird_sf, "MULTIPOLYGON") bird_sf <- st_make_valid(bird_sf)

Pd: the validation part takes some time.