Closed mdsumner closed 6 years ago
That's a bit of an edge case because the data files live on a different server to the web pages. The wget --span-hosts
option helps a bit, by allowing the recursion to cross onto a different domain, but it won't solve everything in this case. I think there is another solution though, stand by ...
For the zip file, what about:
library(rvest)
links <- read_html("http://gadm.org/download_world.html") %>% html_nodes("a")
## find links pointing to gadmNNN.gdb.zip files, take the highest number
src_url <- head(sort(Filter(function(z) grepl("gadm[[:digit:]]+\\.gdb\\.zip", z), sapply(links, html_attr, "href")), decreasing=TRUE), 1)
and use source_url=src_url
in your existing bb_source def. I don't think there's a "pure" bowerbird solution to that one.
What's the issue with RDS levels? You want *_adm0.rds
, *_adm1.rds
, etc?
Maybe:
x <- read_html("http://gadm.org/download_country.html")
## find all non-empty options that are part of the countrySelect element
links <- Filter(nzchar, sapply(x %>% html_node("#countrySelect") %>% html_nodes("option"), html_attr, "value"))
do.call(rbind, lapply(str_match_all(links, "^([[:alpha:]]{3})_.*([[:digit:]])$"), function(z) z[2:3]))
will give you all the countries and how many levels they have, then you construct the appropriate URLs from that?
Ah, ok - that's fine - thanks!
Just for the record, here's the final set up to get all the country-files for all levels, as well as the master GDB (a bit over 2Gb in total)
library(rvest)
links <- read_html("http://gadm.org/download_world.html") %>% html_nodes("a")
## find links pointing to gadmNNN.gdb.zip files, take the highest number
gadm_src_url <- head(sort(Filter(function(z) grepl("gadm[[:digit:]]+\\.gdb\\.zip", z), sapply(links, html_attr, "href")), decreasing=TRUE), 1)
x <- read_html("http://gadm.org/download_country.html")
## find all non-empty options that are part of the countrySelect element
links <- Filter(nzchar, sapply(x %>% html_node("#countrySelect") %>% html_nodes("option"), html_attr, "value"))
gadm_rds0 <- do.call(rbind, lapply(stringr::str_match_all(links, "^([[:alpha:]]{3})_.*([[:digit:]])$"), function(z) z[2:3]))
gadm_rds <- tibble::tibble(name = gadm_rds0[,1], levels = gadm_rds0[,2]) %>%
dplyr::slice(rep(row_number(), levels)) %>% dplyr::group_by(name) %>%
## zero-based
dplyr::mutate(level = row_number() - 1) %>% dplyr::ungroup() %>% dplyr::select(name, level) %>% as.matrix()
template <- file.path(dirname(gadm_src_url), "rds/%s_adm%s.rds")
gadm_rds_src_url <- apply(gadm_rds, 1, function(ab) sprintf(template, ab[1], ab[2]))
library(bowerbird)
gadm.rds <- bb_source(
name="GADM maps and data in RDS format",
id="gadm-maps-rdb",
description="GADM provides maps and spatial data for all countries and their sub-divisions.",
doc_url="http://www.gadm.org",
citation="http://gadm.org/about.html",
source_url= gadm_rds_src_url,
license="http://gadm.org/license.html",
method=list("bb_handler_wget",level=1, robots_off=TRUE),
collection_size= 0.1,
access_function = "base::readRDS",
data_group="Administrative")
gadm <- bb_source(
name="GADM maps and data in ESRI Geodatabase",
id="gadm-maps-gdb",
description="GADM provides maps and spatial data for all countries and their sub-divisions.",
doc_url="http://www.gadm.org",
citation="http://gadm.org/about.html",
source_url=gadm_src_url,
license="http://gadm.org/license.html",
method=list("bb_handler_wget",recursive=TRUE,level=1, robots_off=TRUE),
postprocess=list("bb_unzip"),
collection_size= 1,
access_function = "sf::read_sf",
data_group="Administrative")
my_directory <- "~/bowerbird"
cf <- bb_config(local_file_root=my_directory)
cf <- bb_add(cf, gadm) %>% bb_add(gadm.rds)
status <- bb_sync(cf,verbose=TRUE)
The gadm site and data format has changed, so for anyone visiting this issue now, an updated version of this might look like:
library(bowerbird)
gadm <- bb_source(
name = "GADM maps and data in ESRI Geodatabase",
id = "gadm-maps-gdb",
description = "GADM provides maps and spatial data for all countries and their sub-divisions.",
doc_url = "http://www.gadm.org",
citation = "http://gadm.org/about.html",
source_url = "https://gadm.org/download_world.html",
license = "http://gadm.org/license.html",
method = list("bb_handler_rget", level = 1, accept_download = "gpkg\\.zip$"),
comment = "This will download the data as a single database as well as a version with six separate layers (one for each level of subdivision/aggregation). Adjust the 'accept_download' parameter if you only want one of these",
postprocess = list("bb_unzip"),
collection_size = 7.5,
access_function = "sf::read_sf",
data_group = "Administrative")
cf <- bb_config("~/temp/data/bbtest") %>% bb_add(gadm)
## don't use dry_run = TRUE if you are doing this for real!
bb_sync(cf, dry_run = TRUE, verbose = TRUE)
Which gives:
Thu Jul 26 17:10:50 2018
Synchronizing dataset: GADM maps and data in ESRI Geodatabase
Source URL https://gadm.org/download_world.html
--------------------------------------------------------------------------------------------
this dataset path is: ~/temp/data/bbtest/gadm.org
building file list ... done.
visiting https://gadm.org/download_world.html ...
|====================================================================================================================================| 100%
No encoding supplied: defaulting to UTF-8.
done.
dry_run is TRUE, bb_rget is not downloading the following files:
https://biogeo.ucdavis.edu/data/gadm3.6/gadm36_gpkg.zip
https://biogeo.ucdavis.edu/data/gadm3.6/gadm36_levels_gpkg.zip
Thu Jul 26 17:10:52 2018 dataset synchronization complete: GADM maps and data in ESRI Geodatabase
# A tibble: 1 x 5
name id source_url status files
<chr> <chr> <chr> <lgl> <list>
1 GADM maps and data in ESRI Geodatabase gadm-maps-gdb https://gadm.org/download_world.html TRUE <tibble [2 × 3]>
And the files will be in ~/temp/data/bbtest/biogeo.ucdavis.edu/data/
.
I have these sources for the GADM data behind
raster::getData
I can't get it to access any data from an actual directory URL, and so the gdb.zip is hardcoded and I construct the full (!!) list of URLs available for the level 0 for each ISO3 country from the raster package list.
Obviously this is not robust to version updates, and is not adaptable to varying levels in the RDS (apparently some are higher than 3). Are there wget tricks to make this work more generally?