Closed huizezhang-sherry closed 2 years ago
I'm wondering if the sf team would like to take tidyverse's recommendation to implement through dplyr_row_slice() and dplyr_col_modify()?
If that would solve your problem and not cause any new ones, I'd be happy to accept a PR!
update: pull request submitted: #1963
With this PR I still get
library(sf)
# Linking to GEOS 3.10.2, GDAL 3.4.3, PROJ 8.2.0; sf_use_s2() is TRUE
library(dplyr)
# Attaching package: ‘dplyr’
# The following objects are masked from ‘package:stats’:
# filter, lag
# The following objects are masked from ‘package:base’:
# intersect, setdiff, setequal, union
demo(nc, ask = FALSE, echo = FALSE)
class(nc) <- c("myclass", class(nc))
class(nc)
# [1] "myclass" "sf" "data.frame"
nc |> filter(NAME == "Ashe") |> class()
# [1] "sf" "data.frame"
Ah, my bad:
library(sf)
# Linking to GEOS 3.10.2, GDAL 3.4.3, PROJ 8.2.0; sf_use_s2() is TRUE
library(dplyr)
# Attaching package: ‘dplyr’
# The following objects are masked from ‘package:stats’:
# filter, lag
# The following objects are masked from ‘package:base’:
# intersect, setdiff, setequal, union
demo(nc, ask = FALSE, echo = FALSE)
class(nc) <- c("myclass", class(nc))
class(nc)
# [1] "myclass" "sf" "data.frame"
dplyr_row_slice.myclass <- function(data, i, ...){
out <- vctrs::vec_slice(data, i)
dplyr_reconstruct(out, data)
}
dplyr_reconstruct.myclass <- function(data, template){
class(data) <- class(template)
data
}
nc |> filter(NAME == "Ashe") |> class()
# [1] "myclass" "sf" "data.frame"
@henningte could I ask you to review this PR?
@edzer I'll have a look at it tomorrow.
Before having a closer look into the code, I have the following questions and issues:
dplyr::dplyr_row_slice()
and dplyr::dplyr_col_modify()
are meant to replace all current implementations of dplyr
methods in the sf
package, right? dplyr
methods currently implemented separately in sf
are replaced by the pull request. I understand from the PR comment that implementing this might get complicated for select.sf()
and transmute.sf()
and all dplyr
methods depending on this. However, it should be possible for *join()
methods, right?
Correct me if I'm wrong! My concern is only that it may get confusing which dplyr
method is implemented the old way and which the new way.dplyr
package. Is there any sf
policy whether to implement experimental features?As I understand it it will not replace the column selectors (select, mutate, ...) because sf objects break the tidy contract (or assumption) that obj[1] always has length 1.
Yes, you're right! Except that the PR does replace mutate.sf()
since it does not interfere with the sticky behavior of the active geometry column.
No, the replacement can happen if two conditions are fulfilled:
a) the dplyr function itself is implemented with dplyr_row_slice()
or dplyr_col_modify()
b) the behavior of the dplyr method on the new class is basically the same
The behavior of filter()
, slice()
, arrange()
, and mutate()
on an sf
object are no different than their behaviours on a data frame, hence can be replaced with the two proposed verbs. select
and transmute
have different behaviors (they always retains the geometry column), hence need to have its customised select.sf()
and transmute.sf()
.
inner_join(), left_join(), right_join(), and full_join() coerces x to a tibble, modify the rows, then uses dplyr_reconstruct() to convert back to the same type as x.
"My concern is only that it may get confusing which dplyr method is implemented the old way and which the new way."
Re: The way I see it is that this makes it clearer whether the dplyr verb has a special behavior on the class.
dplyr_row_slice()
or dplyr_col_modify
. method.class()
would signal the method works differently on the class. Definitely, some documentation on this would be helpful here.
Thanks @huizezhang-sherry , this helped me to better understand the purpose of the new functions and where they do not apply. In this case, the PR is as complete as it could possibly be.
The only open question is then whether sf
should built on an experimental feature of another package. @edzer , what do you think about this?
Thanks a lot to both - now merged!
Your PR creates a reverse dependency problem in package himach
, would you mind taking a look at this @huizezhang-sherry?
Package: himach
Check: tests
New result: ERROR
Running ‘spelling.R’ [0s/0s]
Running ‘testthat.R’ [32s/32s]
Running the tests in ‘tests/testthat.R’ failed.
Complete output:
> library(testthat)
> library(himach)
>
> test_check("himach")
[ FAIL 2 | WARN 0 | SKIP 1 | PASS 66 ]
══ Skipped tests ═══════════════════════════════════════════════════════════════
• On CRAN (1)
══ Failed tests ════════════════════════════════════════════════════════════════
── Error (test_routes.R:94:3): find_route works with subsonic option ───────────
Error in `st_sf(x, ..., agr = agr, sf_column_name = sf_column_name)`: sf_column_name %in% all_sfc_names is not TRUE
Backtrace:
▆
1. ├─utils::capture.output(...) at test_routes.R:94:2
2. │ └─base::withVisible(...elt(i))
3. ├─... %>% summarise_routes_for_test()
4. ├─himach (local) summarise_routes_for_test(.)
5. │ └─... %>% ... at test_routes.R:17:2
6. ├─dplyr::mutate(...)
7. ├─dplyr::summarise(...)
8. ├─dplyr::group_by(., fullRouteID)
9. ├─sf::st_drop_geometry(.)
10. ├─dplyr::mutate(., envelope = st_area(envelope))
11. ├─dplyr::mutate(., across(c(gc, crow), st_length), gc_length = gc)
12. └─dplyr:::mutate.data.frame(...)
13. ├─dplyr::dplyr_col_modify(.data, cols)
14. ├─sf:::dplyr_col_modify.sf(.data, cols)
15. ├─base::NextMethod()
16. └─dplyr:::dplyr_col_modify.data.frame(.data, cols)
17. └─dplyr::dplyr_reconstruct(out, data)
18. ├─dplyr:::dplyr_reconstruct_dispatch(data, template)
19. └─sf:::dplyr_reconstruct.sf(data, template)
20. ├─sf::st_as_sf(data, sf_column_name = sfc_name, crs = crs, precision = prec)
21. └─sf:::st_as_sf.data.frame(...)
22. └─sf::st_sf(x, ..., agr = agr, sf_column_name = sf_column_name)
23. └─base::stopifnot(sf_column_name %in% all_sfc_names)
── Error (test_routes.R:134:3): Find multiple routes for multiple aircraft ─────
Error in `st_sf(x, ..., agr = agr, sf_column_name = sf_column_name)`: sf_column_name %in% all_sfc_names is not TRUE
Backtrace:
▆
1. ├─utils::capture.output(...) at test_routes.R:134:2
2. │ └─base::withVisible(...elt(i))
3. ├─... %>% summarise_routes_for_test()
4. ├─himach (local) summarise_routes_for_test(.)
5. │ └─... %>% ... at test_routes.R:17:2
6. ├─dplyr::mutate(...)
7. ├─dplyr::summarise(...)
8. ├─dplyr::group_by(., fullRouteID)
9. ├─sf::st_drop_geometry(.)
10. ├─dplyr::mutate(., envelope = st_area(envelope))
11. ├─dplyr::mutate(., across(c(gc, crow), st_length), gc_length = gc)
12. └─dplyr:::mutate.data.frame(...)
13. ├─dplyr::dplyr_col_modify(.data, cols)
14. ├─sf:::dplyr_col_modify.sf(.data, cols)
15. ├─base::NextMethod()
16. └─dplyr:::dplyr_col_modify.data.frame(.data, cols)
17. └─dplyr::dplyr_reconstruct(out, data)
18. ├─dplyr:::dplyr_reconstruct_dispatch(data, template)
19. └─sf:::dplyr_reconstruct.sf(data, template)
20. ├─sf::st_as_sf(data, sf_column_name = sfc_name, crs = crs, precision = prec)
21. └─sf:::st_as_sf.data.frame(...)
22. └─sf::st_sf(x, ..., agr = agr, sf_column_name = sf_column_name)
23. └─base::stopifnot(sf_column_name %in% all_sfc_names)
[ FAIL 2 | WARN 0 | SKIP 1 | PASS 66 ]
Error: Test failures
Execution halted
Oh I think this bit of codes inside summarise_routes_for_test()
in their test_routes.R is causing some problems here:
r %>%
# wkt is machine-dependent so just extract length/area
mutate(across(c(gc, crow), st_length),
gc_length = gc) %>%
mutate(envelope = st_area(envelope)) %>%
sf::st_drop_geometry() %>%
...
In both errors, r
is an sf
object with gc
as the sf_column
. The mutate step transforms the gc
column to its length and makes it no longer an sfc
column. The old code stores the output still as an sf
object and produces an error when printed:
Error in st_geometry.sf(x) :
attr(obj, "sf_column") does not point to a geometry column.
Did you rename it, without setting st_geometry(obj) <- "newname"?
The old code went smooth because the code does not directly print out the object and the later line sf::st_drop_geometry()
happens to fix this sf_column
problem.
My solution here would be:
1) A new line in dplyr_reconstruct.sf()
is needed: if (!inherits(sfc_name, "sfc")) return(data)
, so that if the sf_column is no longer an sfc
, the data can be returned as a tibble (even if it can't be constructed as an sf) .
2) They need to modify the test codes, potentially other places as well, by either giving the mutated length a different name or dropping the line sf::st_drop_geometry()
since as per the current code, the object passed by mutate will not be an sf
object.
Thanks - I followed 1, to restore old behaviour.
@huizezhang-sherry could you pls check that https://github.com/r-spatial/sf/commit/2df4b5cd859d0e5f791284474fd3ec834e46682f still does what you had in mind? It seems to fix the redist revdep problem.
yea, I'm happy with that.
Just went through the rest in the redist revdep problem, I'm thinking the test on rowwise_df
can be better with using the filter(AREA > .1)
example. The current slice(1)
will give the first row of each group, which is basically everything in a rowwise_df
. But, that's trivial!
Dear All, I really appreciate these changes and they seem to solve quite a few issue I do however note that mutate still seems to change the order of classes when inheriting sf (adopted from the example above):
library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1; sf_use_s2() is TRUE
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
demo(nc, ask = FALSE, echo = FALSE)
class(nc) <- c("myclass", class(nc))
class(nc)
#> [1] "myclass" "sf" "data.frame"
dplyr_row_slice.myclass <- function(data, i, ...){
out <- vctrs::vec_slice(data, i)
dplyr_reconstruct(out, data)
}
dplyr_reconstruct.myclass <- function(data, template){
class(data) <- class(template)
data
}
nc |> mutate(NAME2 = NAME) |> class()
#> [1] "sf" "myclass" "data.frame"
nc |> filter(NAME == "Ashe") |> class()
#> [1] "myclass" "sf" "data.frame"
This would mean packages inheriting sf would still need to implement custom mutate
methods. Would there be a way to prevent this?
Created on 2022-07-10 by the reprex package (v2.0.1)
Thanks; it seems dplyr_reconstruct.myclass
is called before dplyr_reconstruct.sf
. Maybe the latter should strip all class labels before label sf
, and add them back later before sf
before returning?
I somehow had expected the calling order would be reversed, when thinking about reconstructing a class hierarchy.
Thanks for having a look. It also seems that if you implement dplyr_row_slice.myclass
with NextMethod
the dplyr_reconstruct.myclass
is called twice:
library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1; sf_use_s2() is TRUE
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
demo(nc, ask = FALSE, echo = FALSE)
class(nc) <- c("myclass", class(nc))
class(nc)
#> [1] "myclass" "sf" "data.frame"
dplyr_row_slice.myclass <- function(data, i, ...){
message("called 1")
out <- NextMethod()
dplyr_reconstruct(out, data)
}
dplyr_reconstruct.myclass <- function(data, template){
message("called 2")
class(data) <- class(template)
data
}
nc |> mutate(NAME2 = NAME) |> class()
#> called 2
#> [1] "sf" "myclass" "data.frame"
nc |> filter(NAME == "Ashe") |> class()
#> called 1
#> called 2
#> called 2
#> [1] "myclass" "sf" "data.frame"
Could this relate to this line which could result in myclass
being moved down the class order?
Hi, thanks for being interested in this discussion. I think there are two issues here:
1) mutate
is a column wise operation, so you will need to implement dplyr_col_modify.myclass()
to have myclass
take precedence over sf
.
2) I didn't show the mutate
example because implementing dplyr_col_modify.myclass()
will need to take into account the sticky behavior of select/ subset in sf
- see [.sf
. Depending on how it is implemented with myclass
, you may also need these methods ([.myclass
, [[<-.myclass
, print.myclass
, etc) for a mutate
example to work properly. This is beyond the complexity of a proof of concept, but you may have already had them when creating a new class.
We now get the following, new, revdep errors on CRAN checks:
Changes to worse in reverse depends:
Package: eks
Check: examples
New result: ERROR
Running examples in ‘eks-Ex.R’ failed
The error most likely occurred in:
> base::assign(".ptime", proc.time(), pos = "CheckExEnv")
> ### Name: tidyst_kdr
> ### Title: Tidy and geospatial kernel density ridge estimates
> ### Aliases: tidy_kdr st_kdr
> ### Keywords: smooth
>
> ### ** Examples
>
> ## tidy density ridge estimate
>
> ## geospatial density ridge estimate
> data(wa)
> data(grevilleasf)
> hakeoides <- dplyr::filter(grevilleasf, species=="hakeoides")
> ## gridsize=c(21,21) is for illustrative purposes only
> ## remove for more complete KDR
> s1 <- st_kdr(hakeoides, gridsize=c(21,21))
>
> ## base R plot
> xlim <- c(1.2e5, 1.1e6); ylim <- c(6.1e6, 7.2e6)
> plot(wa, xlim=xlim, ylim=ylim)
> plot(sf::st_geometry(hakeoides), add=TRUE, col=3, pch=16)
> plot(s1, add=TRUE, col=6, lwd=3, alpha=0.8)
Adding missing grouping variables: `segment`
Warning in xy.coords(x, y, xlabel, ylabel, log) :
NAs introduced by coercion
Warning in min(x) : no non-missing arguments to min; returning Inf
Warning in max(x) : no non-missing arguments to max; returning -Inf
Warning in plot.window(...) : "add" is not a graphical parameter
Error in plot.window(...) : need finite 'ylim' values
Calls: plot ... NextMethod -> plot.default -> localWindow -> plot.window
Execution halted
Package: mapping
Check: R code for possible problems
New result: NOTE
Warning in fun(libname, pkgname) :
rgeos: versions of GEOS runtime 3.11.0-CAPI-1.17.0
and GEOS at installation 3.10.2-CAPI-1.16.0differ
Package: sftime
Check: examples
New result: ERROR
Running examples in ‘sftime-Ex.R’ failed
The error most likely occurred in:
> base::assign(".ptime", proc.time(), pos = "CheckExEnv")
> ### Name: bind
> ### Title: Bind rows (features) of 'sftime' objects
> ### Aliases: bind rbind.sftime cbind.sftime
>
> ### ** Examples
>
> g1 <- st_sfc(st_point(1:2))
> x1 <- st_sftime(a = 3, geometry = g1, time = Sys.time())
>
> g2 <- st_sfc(st_point(c(4, 6)))
> x2 <- st_sftime(a = 4, geometry = g2, time = Sys.time())
>
> rbind(x1, x2) # works because both tc1 and tc2 have the same class
Spatiotemporal feature collection with 2 features and 1 field
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 1 ymin: 2 xmax: 4 ymax: 6
CRS: NA
Time column with class: 'POSIXt'.
Ranging from 2022-07-12 16:54:09 to 2022-07-12 16:54:09.
a geometry time
1 3 POINT (1 2) 2022-07-12 16:54:09
2 4 POINT (4 6) 2022-07-12 16:54:09
>
> ## Not run:
> ##D st_time(x2) <- 1
> ##D rbind(x1, x2) # error because both tc1 and tc2 do not have the same class
> ## End(Not run)
>
> cbind(x1, x2)
Spatiotemporal feature collection with 1 feature and 3 fields
Active geometry column: geometry
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 1 ymin: 2 xmax: 1 ymax: 2
CRS: NA
Time column with class: 'POSIXt'.
Representing 2022-07-12 16:54:09.
a a.1 time.1 geometry geometry.1 time
1 3 4 2022-07-12 16:54:09 POINT (1 2) POINT (4 6) 2022-07-12 16:54:09
>
> if (require(dplyr))
+ dplyr::bind_cols(x1, x2)
Loading required package: dplyr
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
New names:
• `a` -> `a...1`
• `geometry` -> `geometry...2`
• `time` -> `time...3`
• `a` -> `a...4`
• `geometry` -> `geometry...5`
• `time` -> `time...6`
Error in st_agr.default(x) : all(is.na(x)) is not TRUE
Calls: <Anonymous> ... print.sftime -> st_agr -> st_agr.default -> stopifnot
Execution halted
Package: sspm
Check: re-building of vignette outputs
New result: ERROR
Error(s) in re-building vignettes:
...
--- re-building ‘An_example_with_simulated_data.Rmd’ using rmarkdown
Quitting from lines 250-252 (An_example_with_simulated_data.Rmd)
Error: processing vignette ‘An_example_with_simulated_data.Rmd’ failed with diagnostics:
`.data` must be a valid <grouped_df> object.
Caused by error in `validate_grouped_df()`:
! The `groups` attribute must be a data frame.
--- failed re-building ‘An_example_with_simulated_data.Rmd’
--- re-building ‘Package_and_workflow_design.Rmd’ using rmarkdown
--- finished re-building ‘Package_and_workflow_design.Rmd’
SUMMARY: processing the following file failed:
‘An_example_with_simulated_data.Rmd’
Error: Vignette re-building failed.
Execution halted
So I think I am going to revert this change to how it was in 1.0-7, and leave this issue for someone with more time on their hands and willing to do all the revdep checks before writing a PR.
Hi @edzer,
I apologize for not being aware of the revdep checks before. I'm running these checks now and will investigate accordingly if any issue raises. Would that be okay with you?
Yes, but it will probably not make if for sf 1.0-8, which needs to get out now.
Thanks, I'm okay with that.
Hi @edzer,
I looked into the issue with the eks
package. The failure is because in the list s1
, the element sf
is not an sf
object. The class label goes missing at this line when relocate()
is called from mutate()
in st_kdr()
.
The object fhat.sf
passed into mutate()
in st_kdr()
is an sf
object with groped_df
. The grouped_df
class has a names<-.grouped_df
method but sf
doesn't, so the sf
class is missing after resetting the name.
The code below describes the same issue:
library(sf)
#> Linking to GEOS 3.11.0, GDAL 3.5.1, PROJ 9.0.1; sf_use_s2() is TRUE
nc = read_sf(system.file("shape/nc.shp", package="sf"))
nc_g <- nc %>% dplyr::group_by(AREA) %>% head(5)
nc_g
#> Simple feature collection with 5 features and 14 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -81.74107 ymin: 36.07282 xmax: -75.77316 ymax: 36.58965
#> Geodetic CRS: NAD27
#> # A tibble: 5 × 15
#> # Groups: AREA [5]
#> AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74
#> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 0.114 1.44 1825 1825 Ashe 37009 37009 5 1091 1 10
#> 2 0.061 1.23 1827 1827 Alleg… 37005 37005 3 487 0 10
#> 3 0.143 1.63 1828 1828 Surry 37171 37171 86 3188 5 208
#> 4 0.07 2.97 1831 1831 Curri… 37053 37053 27 508 1 123
#> 5 0.153 2.21 1832 1832 North… 37131 37131 66 1421 9 1066
#> # … with 4 more variables: BIR79 <dbl>, SID79 <dbl>, NWBIR79 <dbl>,
#> # geometry <MULTIPOLYGON [°]>
names(nc_g)[1] <- "area"
nc_g
#> # A tibble: 5 × 15
#> # Groups: area [5]
#> area PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74
#> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 0.114 1.44 1825 1825 Ashe 37009 37009 5 1091 1 10
#> 2 0.061 1.23 1827 1827 Alleg… 37005 37005 3 487 0 10
#> 3 0.143 1.63 1828 1828 Surry 37171 37171 86 3188 5 208
#> 4 0.07 2.97 1831 1831 Curri… 37053 37053 27 508 1 123
#> 5 0.153 2.21 1832 1832 North… 37131 37131 66 1421 9 1066
#> # … with 4 more variables: BIR79 <dbl>, SID79 <dbl>, NWBIR79 <dbl>,
#> # geometry <MULTIPOLYGON [°]>
Created on 2022-07-14 by the reprex package (v2.0.1)
This can be fixed with adding:
#' @export
"names<-.sf" <- function(x, value) {
out <- NextMethod()
dplyr_reconstruct.sf(out, x)
}
in the sf
package and I get eks
passes the revdep check afterwards.
Great, does that also fix sftime
checking? I now reverted your original PR (and released 1.0-8 to CRAN without it), but did this manually as automatically no longer worked, so I guess you'd have to start the PR from scratch when working from branch main.
I think the problem with sftime
is actually a problem with the modified dplyr_reconstruct.sf()
in huizezhang-sherry/sf@main
.
In the following line:
the class of data
is updated with the class of template
(with only the sf
class removed from the class
attribute of template
). This means that if there is a subclass to an sf
object, the subclass remains in the returned object.
In the conflicting sftime
example, exactly this happens, i.e. the merged sf
objects are invalid because the sf
columns get renamed and therefore the modified dplyr_reconstruct.sf()
returns an object with class c("sftime", "data.frame")
which causes the error (but only upon printing).
This is not the case in the current r-spatial/sf@main
code since the class
attribute of data
is only changed to tibble
in a safe way (via dplyr::as_tibble()
) or to sf
in a safe way (via st_as_sf()
).
So the modified dplyr_reconstruct.sf()
should be rewritten to either use only safe class conversions (as is currently the case in r-spatial/sf@main
) or remove the sf
class and all subclasses when the data
is no valid sf
object any more.
Sure, I will modify the quoted line in the upcoming pull request.
From the source code, the
sf
class implements tidyverse methods by directly providing methods for each verb. This will cause an issue forsf
subclasses since thesf
class will always get prioritised over its subclass.I'm wondering if the sf team would like to take tidyverse's recommendation to implement through
dplyr_row_slice()
anddplyr_col_modify()
?Some more details:
Here, the data
nc
augmented with a classmyclass
. The functiondplyr_row_slice()
anddplyr_reconstruct()
are implemented formyclass
.Now if we run
out <- nc %>% filter(NAME == "Ashe")
, one may expectfilter()
->filter.data.frame()
->dplyr_row_slice.myclass()
. This would resultout
to have class in the order of"myclass" "sf" "data.frame"
.But this is not the case:
Created on 2022-06-13 by the reprex package (v2.0.1)