ropensci / bowerbird

Keep a collection of sparkly data resources
https://docs.ropensci.org/bowerbird
Other
48 stars 6 forks source link

structure of list-col for method #18

Closed mdsumner closed 6 years ago

mdsumner commented 6 years ago

I couldn't reconcile the documentation here while trying to set up a custom accept_regex.

https://github.com/AustralianAntarcticDivision/bowerbird/blame/master/vignettes/bowerbird.Rmd#L432-L434

Specifically, I don't think this code fragment could work:

mysrc <- mysrc %>%
  mutate(method=c(method,list(accept_regex="/2017/")))

This is an example of what worked for me (with a private user/password not included here).

library(bowerbird)
my_directory <- "~/my/data/directory"
cf <- bb_config(local_file_root = my_directory)
my_source <- blueant::sources("CMEMS global gridded SSH near-real-time")
library(dplyr)
my_source <- my_source %>%
  mutate(method= list(c("bb_handler_wget", level = 3, 
                      accept_regex = "nrt_global_allsat_phy_l4_2018.*.nc")))
raymondben commented 6 years ago

Hrumph, you are right, that fragment doesn't work. This is syntactically OK:

mysrc <- mysrc %>%
  mutate(method=list(c(method[[1]],list(accept_regex="/2017/"))))

but is appallingly ugly code, and the actual intent (accepting only 2017 files) doesn't work. It seems to ignore the regex for reasons I don't understand.

Your example works but requires re-specifying the existing method components. I was hoping the example would just show a modification/addition of a method flag, without needing to re-specify the entire method entry. This non-dplyr way would work:

mysrc$method[[1]]$accept_regex <- "whatever"

There must be a less-ugly way to do it via mutate tho.

mdsumner commented 6 years ago

I think the power of mutate is not needed though, since this would always be on a single source? At the moment I favour the non-dplyr way :)

What if we ensured all list-cols were systematically structured, so normalize many properties to name,value in long form?

Needs a generic handler for each list-col:

bb_handler_method <- function(name, ...) {
  properties <- list(...)
  propnames <- names(properties)
  lens <- lengths(properties)
  propnames_rep <- rep(propnames, lens)
  stopifnot(nchar(all(names(propnames))) > 0)
  structure(tibble::tibble(method_property = c("name", propnames_rep), 
                 method_value = c(name, unlist(properties))), 
            class = c("bb_handler_df", "tbl_df", "tbl", "data.frame"))
}

And then that is wrapped in a list

library(bowerbird)
my_directory <- "~/my/data/directory"
cf <- bb_config(local_file_root = my_directory)
my_source <- blueant::sources("CMEMS global gridded SSH near-real-time")
library(dplyr)
my_source <- my_source %>%
  mutate(method= list(bb_handler_method("bb_handler_wget", level = 3, accept_regex = "/2017/")))

And then sources can be expanded out using tidyr unnest

my_source %>% dplyr::select(name, method) %>% tidyr::unnest()
# A tibble: 3 x 3
#  name                                    method_property method_value   
#  <chr>                                   <chr>           <chr>          
#1 CMEMS global gridded SSH near-real-time name            bb_handler_wget
#2 CMEMS global gridded SSH near-real-time level           3              
#3 CMEMS global gridded SSH near-real-time accept_regex    /2017/ 

It means that all of the list-cols need this mechanism, and I don't think unnest can work with two disparate columns at once - so the select() is crucial.

It means that properties have to be character - no raw functions or numeric - but maybe that's ok?

mdsumner commented 6 years ago

Hmm, it doesn't solve the "append" aspect, and unnest definitely cannot mix disparate list-cols

Here I added another handler for multiple sources, and use a different number of those compared to the method values. Dunno :)

bb_handler_source <- function(srcs) {
  tibble::tibble(source_url = srcs)
}

my_source <- my_source %>%
  mutate(source_url = bb_handler_source(c("https://abc.com", "ftp:/xzy.org", "ecwp://123.a")))
  mutate(method= list(bb_handler_method("bb_handler_wget", level = 3, accept_regex = c("/2017/", "/2018"))))

my_source %>% dplyr::select(name, method) %>% tidyr::unnest()
my_source %>% dplyr::select(name, source_url) %>% tidyr::unnest()
raymondben commented 6 years ago

Returning to an earlier point, I think the advantage of dplyr/mutate here is that it gives a nicer way to specify my personal selection of sources, some with modifications for my own requirements:

cf <- cf %>% bb_add(sources("choice1") %>%
    bb_add(sources("choice2") %>% mutate(whatever=something)) %>%
    etc

which can be done without mutate but just a bit messier. This style of syntax just seems more appealing to me.

I'm not keen to make further changes to the actual data structures at this point, but given the fairly awful nature of dealing with list-cols here a bb_source_modify helper function might be very handy? Perhaps along the lines of your bb_handler_method:

my_source %>% bb_source_modify(user="me",method=list(level=5,accept="blah"))

which would overwrite the existing method level and accept entries if they exist, or create them if not, and add/modify the user entry.

mdsumner commented 6 years ago

That sounds good, I'm definitely not pitching for major change!

raymondben commented 6 years ago

See https://github.com/AustralianAntarcticDivision/bowerbird/commit/909566a1152248cbde3e753f866a101d071a149d. Your example:

library(bowerbird)
my_directory <- "~/my/data/directory"
cf <- bb_config(local_file_root = my_directory)
my_source <- blueant::sources("CMEMS global gridded SSH near-real-time")
library(dplyr)
my_source <- my_source %>%
  mutate(method= list(c("bb_handler_wget", level = 3, 
                      accept_regex = "nrt_global_allsat_phy_l4_2018.*.nc")))

Can now be simplified to:

library(bowerbird)
my_directory <- "~/my/data/directory"
cf <- bb_config(local_file_root = my_directory)
my_source <- blueant::sources("CMEMS global gridded SSH near-real-time") %>%
  bb_modify_source(method= list(accept_regex = "nrt_global_allsat_phy_l4_2018.*.nc"))

Note that dplyr is now not needed. And if you really wanted to, I think that this would be equivalent:

my_source <- blueant::sources("CMEMS global gridded SSH near-real-time") %>%
  bb_modify_source(method= list(accept = "*l4_2018*"))

(Not tested. But you are only matching on the file name here, not the full URL, so you can get away with passing a file glob via accept in place of a regexp via accept_regex.)