ropensci / software-review-meta

For organizing projects related to rOpenSci Software Peer Review
10 stars 3 forks source link

Recommend sf over sp? #47

Closed maelle closed 2 years ago

maelle commented 6 years ago

@jhollist @ateucher @Robinlovelace @mpadge cc @sckott

Dear geospatial+rOpenSci R people I can think of right now, do you think it would make sense for rOpenSci to recommend sf over sp in the onboarding guidelines?

Here are our current package recommendations, e.g. xml2 over XML.

Robinlovelace commented 6 years ago

My 10 cents: yes!

jhollist commented 6 years ago

I agree with @Robinlovelace. That being said some flexibility should be allowed given the state of flux with spatial, especially since raster doesn't yet fully support sf objects out of the box and requires coercing to sp.

mpadge commented 6 years ago

My :+1: to @jhollist

recommend_sf <- FALSE
if (!"raster" %in% pkg.deps)
  recommend_sf <- TRUE

if (!recommend_sf){
  # be careful
}

That condition could be modified through extending @noamross's fasterize?

maelle commented 6 years ago

thanks for the feedback everyone! I forgot to tag @mdsumner 🤦‍♀️ @mdsumner, what's your opinion?

ateucher commented 6 years ago

I also agree it's best at this point to recommend sf, with the same caveats as @jhollist. There is a lot of the r world that is still firmly in the sp/rgal/rgeos ecosystem though. Lately I've been making sf the "first-class citizen" and include sp in Suggests, and support it by way of sf's pretty great methods for coercion to/from.

mpadge commented 6 years ago

To complicate matters just a bit, there is the beastly nature of sf to consider. To borrow some code from @hrbrmstr:

library (fs)
library (tidyverse)
installed.packages() %>%
    as_data_frame() %>%
    mutate(pkg_dir = sprintf("%s/%s", LibPath, Package)) %>%
    select(Package, pkg_dir) %>%
    filter(Package %in% c ("sf", "sp")) %>%
    mutate(pkg_dir_size = map_dbl(pkg_dir, ~{
              fs::dir_info(.x, all=TRUE, recursive=TRUE) %>%
              summarise(tot_dir_size = sum(size)) %>% 
              pull(tot_dir_size)
    }) %>%
       format(big.mark = ",")) %>%
    select (Package, pkg_dir_size)
#  Package  pkg_dir_size
#  sf       18,902,203
#  sp       2,093,160

Recommending a package that is almost 10 times the size of its predecessor; that represents one of the largest installs in anybody's site-library; and that by my reckoning is close to 20GB, by @hrbrmstr's more like 113GB, would suggest something like:

if (recommend_sf){
  # be careful
}
hrbrmstr commented 6 years ago

ohai!

(MB not GB :-)

I was shocked when I saw the results of the lib dir sizes.

There's also this to consider:

sp dep tree image

vs

sf dep tree image

sf brings a ton of heavy hitters along for the ride.

maelle commented 6 years ago

Thanks all! A lot of food for thought! 🤔 💥

Robinlovelace commented 6 years ago

Cannot compete with Bob's viz (awesome stuff - shows how interdependent we're all becoming which I recall you saying is a tendency to resist!) but just thought I'd flag that many of those pkgs will already be installed on many useRs computers, especially magrittr + Rcpp given their popularity. Not sure about e1071 but I'd argue that uptake of units is no bad thing (notwithstanding its potentially painful dependency on the C library udunits2):

image

Code: https://github.com/Robinlovelace/geocompr/blob/master/code/cranlogs.R but starting out with this:

pkgs = c("Rcpp", "MASS", "e1071", "units", "magrittr")
dd = cranlogs::cran_downloads(packages = pkgs,
                              from = "2013-01-01", to = Sys.Date())
mdsumner commented 6 years ago

I don't think it's possible to make a blanket recommendation for onboarding, sf is really not for package devs, it's just not that kind of tool.

cboettig commented 6 years ago

Just wondering if it is worth re-framing the question somewhat: is there some approach that developers of packages working with spatial data (e.g., accessing a data repository or API that distributes spatial data) can do to make their packages more sf-ready?

Or more generally, would it make sense to have some recommendations about spatial data standards in rOpenSci not necessarily tied to any particular package ecosystem? e.g. for starters, "packages working with spatial data should declare the projection used". Would it make sense to suggest something like geojson as a preferred format for vector spatial data (unless the data needs one of those unusual simple features not part of the geojson standard?) Or is that unnecessarily prescriptive (i.e. users are better off always firing up the sf / libgdal swiss-army knife and enjoy agnosticism to filetype?)

noamross commented 6 years ago

A few points:

1) I was fascinated when @mdsumner submitted a PR to remove sf dependency from a package of mine that does nothing but transform sf objects. It made me realize that the sf package is more like a reference implementation of this object type, along with a number of tools for dealing with it, and other geospatial package developers are building and modifying sf objects with custom code.

2) @jhollist's framed the issue well in his NLMR review, where he recommended authors have a plan to "future-proof" the package to deal with the changing geospatial ecosystem. I think we should include some guidance to authors and reviewers that they consider how their code will stay compatible with volatile parts of the R ecosystem. This would be a highly general top-level concept, (along with performance, documentation, etc.), but then we can provide some more specific guidance on areas such as geospatial, text, etc.

3) @cboettig I do think that those recommendations are a little too prescriptive, but I do think that recommendations for future-proofing/compatibility consider file format choice and sufficient storage of metadata to deal with changes in the future.

mpadge commented 2 years ago

For the same of completeness here, here is a link to an issue thread where @edzer states:

the maintainer of sp has made it clear that new software should build upon sf, not sp. There is no longer a reason to continue building new software using sp classes.

edzer commented 2 years ago

Now that @rsbivand has retired, rgdal, maptools and rgeos will retire as well, by the end of 2023. To do that, sp will have to depend on sf: right now sp uses rgdal and rgeos for important tasks (validation of CRS, hole-in-polygon detection, although rgdal and rgeos are "just" in Suggests:. sp is not the way forward.

rsbivand commented 2 years ago

Any role for a maintenance-mode only sp going forward is to keep packages using spatial vector representations and unwilling to migrate to sf directly from breaking. For similarly placed packages using S4 classes inheriting from sp gridded classes, the picture is more complex, as stars and terra are arguably both useful alternatives, and should both be supported, but with no recommended preference (use cases should decide for themselves). So for vector data, yes, only recommend sf; for raster data, probably no call should be made beyond pointing out that raster's use of rgdal to access GDAL has EOL to the end of 2023 latest.

mpadge commented 2 years ago

Thanks @edzer and @rsbivand for useful clarification! We'll update our recommended scaffolding section of our Development Guide to restate what you've clarified here. Much appreciated, and all the best to you @rsbivand for your days and years ahead! The entire R community will long remain deeply indebted to your contributions!

rsbivand commented 2 years ago

Thanks @mpadge ! The proposed scaffolding text looks well-balanced.

mpadge commented 2 years ago

Ping our Associate Editor for Spatial Software, @Paula-Moraga, so you're aware of these updates :smile: Our automated checking system will notify whenever these (potentially) obsolete packages are used, so you'll see any such cases straight away.

maelle commented 2 years ago

Resolved by @mpadge's PR to the dev guide https://github.com/ropensci/dev_guide/pull/362

Thanks to all those who chimed in!