bb and GADM - Githubissues

mdsumner commented 6 years ago

I have these sources for the GADM data behind raster::getData

b <- "http://biogeo.ucdavis.edu/data/gadm2.8/rds/%s_adm0.rds"
gadm0 <- sprintf(b, raster::getData("ISO3")$ISO3)

gadm.rds <- bb_source(
  name="GADM maps and data in RDS format",
  id="gadm-maps-rdb",
  description="GADM provides maps and spatial data for all countries and their sub-divisions.",
  doc_url="http://www.gadm.org",
  citation="http://gadm.org/about.html",
  source_url= gadm0,
  license="http://gadm.org/license.html",
  method=list("bb_handler_wget",level=1, robots_off=TRUE),
 collection_size= 0.1,
  access_function = "base::readRDS",
  data_group="Administrative")

gadm <- bb_source(
  name="GADM maps and data in ESRI Geodatabase",
  id="gadm-maps-gdb",
  description="GADM provides maps and spatial data for all countries and their sub-divisions.",
  doc_url="http://www.gadm.org",
  citation="http://gadm.org/about.html",
  source_url=c("http://biogeo.ucdavis.edu/data/gadm2.8/gadm28.gdb.zip"),
  license="http://gadm.org/license.html",
  method=list("bb_handler_wget",recursive=TRUE,level=1, robots_off=TRUE),
  postprocess=list("bb_unzip"),
  collection_size= 1,
  access_function = "sf::read_sf",
  data_group="Administrative")

I can't get it to access any data from an actual directory URL, and so the gdb.zip is hardcoded and I construct the full (!!) list of URLs available for the level 0 for each ISO3 country from the raster package list.

Obviously this is not robust to version updates, and is not adaptable to varying levels in the RDS (apparently some are higher than 3). Are there wget tricks to make this work more generally?

raymondben commented 6 years ago

That's a bit of an edge case because the data files live on a different server to the web pages. The wget --span-hosts option helps a bit, by allowing the recursion to cross onto a different domain, but it won't solve everything in this case. I think there is another solution though, stand by ...

raymondben commented 6 years ago

For the zip file, what about:

library(rvest)
links <- read_html("http://gadm.org/download_world.html") %>% html_nodes("a")
## find links pointing to gadmNNN.gdb.zip files, take the highest number
src_url <- head(sort(Filter(function(z) grepl("gadm[[:digit:]]+\\.gdb\\.zip", z), sapply(links, html_attr, "href")), decreasing=TRUE), 1)

and use source_url=src_url in your existing bb_source def. I don't think there's a "pure" bowerbird solution to that one.

What's the issue with RDS levels? You want *_adm0.rds, *_adm1.rds, etc? Maybe:

x <- read_html("http://gadm.org/download_country.html")
## find all non-empty options that are part of the countrySelect element
links <- Filter(nzchar, sapply(x %>% html_node("#countrySelect") %>% html_nodes("option"), html_attr, "value"))
do.call(rbind, lapply(str_match_all(links, "^([[:alpha:]]{3})_.*([[:digit:]])$"), function(z) z[2:3]))

will give you all the countries and how many levels they have, then you construct the appropriate URLs from that?

mdsumner commented 6 years ago

Ah, ok - that's fine - thanks!

mdsumner commented 6 years ago

Just for the record, here's the final set up to get all the country-files for all levels, as well as the master GDB (a bit over 2Gb in total)

library(rvest)
links <- read_html("http://gadm.org/download_world.html") %>% html_nodes("a")
## find links pointing to gadmNNN.gdb.zip files, take the highest number
gadm_src_url <- head(sort(Filter(function(z) grepl("gadm[[:digit:]]+\\.gdb\\.zip", z), sapply(links, html_attr, "href")), decreasing=TRUE), 1)
x <- read_html("http://gadm.org/download_country.html")
## find all non-empty options that are part of the countrySelect element
links <- Filter(nzchar, sapply(x %>% html_node("#countrySelect") %>% html_nodes("option"), html_attr, "value"))
gadm_rds0 <- do.call(rbind, lapply(stringr::str_match_all(links, "^([[:alpha:]]{3})_.*([[:digit:]])$"), function(z) z[2:3]))

gadm_rds <- tibble::tibble(name = gadm_rds0[,1], levels = gadm_rds0[,2]) %>% 
  dplyr::slice(rep(row_number(), levels)) %>% dplyr::group_by(name) %>% 
  ## zero-based
  dplyr::mutate(level = row_number() - 1) %>% dplyr::ungroup() %>% dplyr::select(name, level) %>% as.matrix()

template <- file.path(dirname(gadm_src_url), "rds/%s_adm%s.rds")
gadm_rds_src_url <- apply(gadm_rds, 1, function(ab) sprintf(template, ab[1], ab[2]))

library(bowerbird)
gadm.rds <- bb_source(
  name="GADM maps and data in RDS format",
  id="gadm-maps-rdb",
  description="GADM provides maps and spatial data for all countries and their sub-divisions.",
  doc_url="http://www.gadm.org",
  citation="http://gadm.org/about.html",
  source_url= gadm_rds_src_url,
  license="http://gadm.org/license.html",
  method=list("bb_handler_wget",level=1, robots_off=TRUE),
 collection_size= 0.1,
  access_function = "base::readRDS",
  data_group="Administrative")

gadm <- bb_source(
  name="GADM maps and data in ESRI Geodatabase",
  id="gadm-maps-gdb",
  description="GADM provides maps and spatial data for all countries and their sub-divisions.",
  doc_url="http://www.gadm.org",
  citation="http://gadm.org/about.html",
  source_url=gadm_src_url,
  license="http://gadm.org/license.html",
  method=list("bb_handler_wget",recursive=TRUE,level=1, robots_off=TRUE),
  postprocess=list("bb_unzip"),
  collection_size= 1,
  access_function = "sf::read_sf",
  data_group="Administrative")

my_directory <- "~/bowerbird"
cf <- bb_config(local_file_root=my_directory)

cf <- bb_add(cf, gadm) %>% bb_add(gadm.rds)
status <- bb_sync(cf,verbose=TRUE)

raymondben commented 6 years ago

The gadm site and data format has changed, so for anyone visiting this issue now, an updated version of this might look like:

library(bowerbird)
gadm <- bb_source(
  name = "GADM maps and data in ESRI Geodatabase",
  id = "gadm-maps-gdb",
  description = "GADM provides maps and spatial data for all countries and their sub-divisions.",
  doc_url = "http://www.gadm.org",
  citation = "http://gadm.org/about.html",
  source_url = "https://gadm.org/download_world.html",
  license = "http://gadm.org/license.html",
  method = list("bb_handler_rget", level = 1, accept_download = "gpkg\\.zip$"),
  comment = "This will download the data as a single database as well as a version with six separate layers (one for each level of subdivision/aggregation). Adjust the 'accept_download' parameter if you only want one of these",
  postprocess = list("bb_unzip"),
  collection_size = 7.5,
  access_function = "sf::read_sf",
  data_group = "Administrative")

cf <- bb_config("~/temp/data/bbtest") %>% bb_add(gadm)

## don't use dry_run = TRUE if you are doing this for real!
bb_sync(cf, dry_run = TRUE, verbose = TRUE)

Which gives:

Thu Jul 26 17:10:50 2018
Synchronizing dataset: GADM maps and data in ESRI Geodatabase
Source URL https://gadm.org/download_world.html
--------------------------------------------------------------------------------------------

 this dataset path is: ~/temp/data/bbtest/gadm.org
 building file list ... done.
 visiting https://gadm.org/download_world.html ... 
  |====================================================================================================================================| 100%
No encoding supplied: defaulting to UTF-8.

 done.
 dry_run is TRUE, bb_rget is not downloading the following files:
 https://biogeo.ucdavis.edu/data/gadm3.6/gadm36_gpkg.zip
 https://biogeo.ucdavis.edu/data/gadm3.6/gadm36_levels_gpkg.zip

Thu Jul 26 17:10:52 2018 dataset synchronization complete: GADM maps and data in ESRI Geodatabase
# A tibble: 1 x 5
  name                                   id            source_url                           status files           
  <chr>                                  <chr>         <chr>                                <lgl>  <list>          
1 GADM maps and data in ESRI Geodatabase gadm-maps-gdb https://gadm.org/download_world.html TRUE   <tibble [2 × 3]>

And the files will be in ~/temp/data/bbtest/biogeo.ucdavis.edu/data/.

ropensci / bowerbird

bb and GADM #19