Closed mdsumner closed 6 years ago
Hrumph, you are right, that fragment doesn't work. This is syntactically OK:
mysrc <- mysrc %>%
mutate(method=list(c(method[[1]],list(accept_regex="/2017/"))))
but is appallingly ugly code, and the actual intent (accepting only 2017 files) doesn't work. It seems to ignore the regex for reasons I don't understand.
Your example works but requires re-specifying the existing method components. I was hoping the example would just show a modification/addition of a method flag, without needing to re-specify the entire method entry. This non-dplyr way would work:
mysrc$method[[1]]$accept_regex <- "whatever"
There must be a less-ugly way to do it via mutate tho.
I think the power of mutate is not needed though, since this would always be on a single source? At the moment I favour the non-dplyr way :)
What if we ensured all list-cols were systematically structured, so normalize many properties to name,value in long form?
Needs a generic handler for each list-col:
bb_handler_method <- function(name, ...) {
properties <- list(...)
propnames <- names(properties)
lens <- lengths(properties)
propnames_rep <- rep(propnames, lens)
stopifnot(nchar(all(names(propnames))) > 0)
structure(tibble::tibble(method_property = c("name", propnames_rep),
method_value = c(name, unlist(properties))),
class = c("bb_handler_df", "tbl_df", "tbl", "data.frame"))
}
And then that is wrapped in a list
library(bowerbird)
my_directory <- "~/my/data/directory"
cf <- bb_config(local_file_root = my_directory)
my_source <- blueant::sources("CMEMS global gridded SSH near-real-time")
library(dplyr)
my_source <- my_source %>%
mutate(method= list(bb_handler_method("bb_handler_wget", level = 3, accept_regex = "/2017/")))
And then sources can be expanded out using tidyr unnest
my_source %>% dplyr::select(name, method) %>% tidyr::unnest()
# A tibble: 3 x 3
# name method_property method_value
# <chr> <chr> <chr>
#1 CMEMS global gridded SSH near-real-time name bb_handler_wget
#2 CMEMS global gridded SSH near-real-time level 3
#3 CMEMS global gridded SSH near-real-time accept_regex /2017/
It means that all of the list-cols need this mechanism, and I don't think unnest can work with two disparate columns at once - so the select() is crucial.
It means that properties have to be character - no raw functions or numeric - but maybe that's ok?
Hmm, it doesn't solve the "append" aspect, and unnest definitely cannot mix disparate list-cols
Here I added another handler for multiple sources, and use a different number of those compared to the method values. Dunno :)
bb_handler_source <- function(srcs) {
tibble::tibble(source_url = srcs)
}
my_source <- my_source %>%
mutate(source_url = bb_handler_source(c("https://abc.com", "ftp:/xzy.org", "ecwp://123.a")))
mutate(method= list(bb_handler_method("bb_handler_wget", level = 3, accept_regex = c("/2017/", "/2018"))))
my_source %>% dplyr::select(name, method) %>% tidyr::unnest()
my_source %>% dplyr::select(name, source_url) %>% tidyr::unnest()
Returning to an earlier point, I think the advantage of dplyr/mutate here is that it gives a nicer way to specify my personal selection of sources, some with modifications for my own requirements:
cf <- cf %>% bb_add(sources("choice1") %>%
bb_add(sources("choice2") %>% mutate(whatever=something)) %>%
etc
which can be done without mutate but just a bit messier. This style of syntax just seems more appealing to me.
I'm not keen to make further changes to the actual data structures at this point, but given the fairly awful nature of dealing with list-cols here a bb_source_modify
helper function might be very handy? Perhaps along the lines of your bb_handler_method
:
my_source %>% bb_source_modify(user="me",method=list(level=5,accept="blah"))
which would overwrite the existing method level
and accept
entries if they exist, or create them if not, and add/modify the user entry.
That sounds good, I'm definitely not pitching for major change!
See https://github.com/AustralianAntarcticDivision/bowerbird/commit/909566a1152248cbde3e753f866a101d071a149d. Your example:
library(bowerbird)
my_directory <- "~/my/data/directory"
cf <- bb_config(local_file_root = my_directory)
my_source <- blueant::sources("CMEMS global gridded SSH near-real-time")
library(dplyr)
my_source <- my_source %>%
mutate(method= list(c("bb_handler_wget", level = 3,
accept_regex = "nrt_global_allsat_phy_l4_2018.*.nc")))
Can now be simplified to:
library(bowerbird)
my_directory <- "~/my/data/directory"
cf <- bb_config(local_file_root = my_directory)
my_source <- blueant::sources("CMEMS global gridded SSH near-real-time") %>%
bb_modify_source(method= list(accept_regex = "nrt_global_allsat_phy_l4_2018.*.nc"))
Note that dplyr
is now not needed. And if you really wanted to, I think that this would be equivalent:
my_source <- blueant::sources("CMEMS global gridded SSH near-real-time") %>%
bb_modify_source(method= list(accept = "*l4_2018*"))
(Not tested. But you are only matching on the file name here, not the full URL, so you can get away with passing a file glob via accept
in place of a regexp via accept_regex
.)
I couldn't reconcile the documentation here while trying to set up a custom accept_regex.
https://github.com/AustralianAntarcticDivision/bowerbird/blame/master/vignettes/bowerbird.Rmd#L432-L434
Specifically, I don't think this code fragment could work:
This is an example of what worked for me (with a private user/password not included here).