ropensci / software-review

rOpenSci Software Peer Review.
291 stars 104 forks source link

Bowerbird #139

Closed raymondben closed 6 years ago

raymondben commented 7 years ago

Summary

A package for maintaining a local collection of data sets from a range of data providers. Bowerbird can mirror an entire remote collection of files, using wget's recursive download functionality. Bowerbird also provides some functions around data provenance and versioning (it doesn't fundamentally solve these issues, but goes some way towards solutions).

Package: bowerbird
Type: Package
Title: Keep a Collection of Sparkly Data Resources
Version: 0.3.4
Authors@R: c(person("Ben","Raymond",email="ben.raymond@aad.gov.au",
       role=c("aut","cre")),
       person("Michael","Sumner",role="aut"))
Description: Tools to get and maintain a data repository from third-party data
    providers.
URL: https://github.com/AustralianAntarcticDivision/bowerbird
BugReports: https://github.com/AustralianAntarcticDivision/bowerbird/issues
License: MIT + file LICENSE
Imports:
    assertthat,
    dplyr,
    openssl,
    R.utils,
    rmarkdown,
    rvest,
    stringr,
    xml2
LazyLoad: yes
RoxygenNote: 6.0.1
Suggests:
    archive,
    knitr,
    testthat,
    covr
Remotes: jimhester/archive
VignetteBuilder: knitr

Nothing to our knowledge that really does the same thing. Some similarity to https://github.com/ropensci/rdataretriever, though rdataretriever seems to be angled towards biodiversity data sets in particular and creating sensible local database structures for them. Bowerbird is focused on mirroring remote data to a local file system, and providing some functions around data provenance. Passing overlap with http packages (httr, crul) but these are generally intended for single-transaction sort of usage. Jeroen's curl package (not an ropensci one?) is also similar to bowerbird in that it wraps a underlying http client: bowerbird typically uses wget under the hood to accomplish its web traffic, whereas curl binds to libcurl. AFAIK curl doesn't support mirroring of external sites (which wget does, and which bowerbird relies heavily on).

Requirements

Confirm each of the following by checking the box. This package:

Publication options

Detail

Additional notes re: our presubmission enquiry

General note: the package is not in a final polished state, but we think far enough advanced (and stable enough) to be a good point for onboarding consideration.

It would be helpful to actually separate out the core mechanism and additional sources. This could go as far as having separate packages (which we could handle together).

Fair suggestion, and one that we've considered - and maybe that split is still reasonable to consider down the track. But for now, at least, we think it's better to keep core-functionality and the data-source-definitions bundled together.

Have you considered using rappdirs for default data directories?

It's up to the user where they want to put their data. We do make a suggestion in the README and vignette for users to consider rappdirs.

maelle commented 7 years ago

Editor checks:


Editor comments

Thanks for your submission @raymondben! I'm currently seeking one reviewer (already found one)!

Here is the output of goodpractice::gp() for you and the reviewers to consider (note: we don't expect you to have 100% unit test code coverage!)

It is good practice to

  ✖ write unit tests for all functions, and all package code in general. 79% of
    code lines are covered by test cases.

    R/aadc_eds_handler.R:14:NA
    R/aadc_eds_handler.R:15:NA
    R/aadc_eds_handler.R:16:NA
    R/aadc_eds_handler.R:17:NA
    R/aadc_eds_handler.R:27:NA
    ... and 385 more lines

  ✖ write short and simple functions. These functions have high cyclomatic
    complexity:bb_source (56).
  ✖ avoid long code lines, it is bad for readability. Also, many people prefer
    editor windows that are about 80 characters wide. Try make your lines
    shorter than 80 characters

    R\aadc_eds_handler.R:28:1
    R\aadc_eds_handler.R:32:1
    R\aadc_eds_handler.R:36:1
    R\aadc_eds_handler.R:38:1
    R\aadc_eds_handler.R:40:1
    ... and 480 more lines

  ✖ avoid calling setwd(), it changes the global environment. If you need it,
    consider using on.exit() to restore the working directory.

    R\aadc_eds_handler.R:44:5
    R\amps_handler.R:44:9
    R\provenance.R:50:5
    R\sync.R:63:5
    R\utils.R:77:5
    ... and 2 more lines

Reviewers: @MilesMcBain @lwasser Due date: 2017-11-07

raymondben commented 7 years ago

Minor comment on the goodpractice output, which I meant to include in the initial submission: setwd is necessary, but all calls are protected by on.exit(), which also takes care of other global changes where necessary (e.g. to http_proxy env settings).

maelle commented 7 years ago

Ok makes sense @raymondben, thanks for this comment!

maelle commented 7 years ago

Thanks for agreeing to review this package @MilesMcBain @lwasser! :smile_cat: Your reviews are due on 2017-09-04.

As a reminder here are links to the recently updated reviewing and packaging guides and to the review template.

MilesMcBain commented 7 years ago

With the Editor's permission, I am posting some feedback ahead of a full review in the interests of discussing potential structural changes as early as possible.

The issue of separating the core functionality from data sources was raised by the editors and the more I poke around under the hood, the more I am convinced this will also be my recommendation.

Part of my motivation is selfish. The number of sources bowerbird touches means a comprehensive review of this package, as is, will be an onerous task. Some other reasons I can think of why this might be a good idea:

maelle commented 7 years ago

Thanks @MilesMcBain!

raymondben commented 7 years ago

Thanks @MilesMcBain. I'll confess I'm not 100% sold on the splitting idea, but I'm also not vehemently against it and your points are good. @mdsumner and I will have a chat and see what we come up with. In the meantime, suggest that you and @lwasser don't spend too much more time on it, since it's likely to look fairly different if we do split it.

raymondben commented 7 years ago

OK, so: on reflection, we agree that splitting is probably a good idea. @maelle, what's the preference here? Do you want to close this issue and we resubmit, or do we just carry on here? (The split version --- not entirely finalized yet --- is currently sitting in the "split" branch.) Note that we'll only submit the core part of the package at this stage, including a few example data sources. The extended collection of data sources will be hived off into a separate package for further work when time permits, and possible submission.

For completeness, some thoughts on Miles' comments above:

Package Maintenance: It seems like ensuring calls to these different sources and APIs are working as advertised could become a maintenance issue that leads to frequent package updates. Have remote server changes generated much work to date?

Moving sources into one or more other packages doesn't reduce the maintenance burden, it just spreads it across packages. We see occasional but not overwhelming changes to remote servers that require tweaks to the handlers, but over the 10+ years that we've been doing this (in various guises) we haven't found it to be a showstopper. However, splitting the package (into "core" plus other data-themed package(s) that depend on it) should indeed mean that the core components will remain more stable. That alone seems like a pretty good argument in favour of the split.

Nicer Data Interfaces: Assuming data sources get moved to new packages that depend on bowerbird, there may then be scope to tweak the APIs of those packages so they more closely suit their user's needs. By this I mean implement API specific function arguments that mean users do not need to understand the workings of wget arguments. An example would be setting a date range as function arguments, instead of directly as --reject-regex arguments in the case of the README download size example.

Yes, agreed, but it's worth exploring this a bit. The aim of bowerbird is to be flexible enough to be able to handle a wide range of data sources, and to allow new data sources to b e added fairly easily. Hence bowerbird itself doesn't generally attempt to wrap syntactic sugar around the wget arguments, because it's not possible to do it in a general manner. It's not just dates you might want to subset, it could also be geographic space, or vertical levels within a climate model, or different file formats, or any number of other facets depending on the data set --- and different data providers will structure their data directories and files differently. There's simply not enough consistency. But in most cases this sort of subsetting isn't necessary, and the wget calls are fairly straightforward --- so that aim of bowerbird (ability to add new data sources fairly easily) generally seems to hold up. More complicated examples will always arise, but that's just the nature of the task.

Now, for sure, nicer function arguments could be part of more targeted packages (that might themselves depend on bowerbird) --- but doing so immediately makes the maintenance burden high because of the complexity and diversity of those options, and hence such packages tend to focus on a small number of data providers. So where is it best to invest that effort? Is it at the data-fetching level? Maybe. But a data-fetching package might well be used to maintain a data library on behalf of a number of users (this is what we do, anyway). In this scenario you're generally likely to want all files that are part of a given collection, because it's probably impossible to know a priori which ones your users will want. And if it's a centrally-maintained library within an organisation, then only a small number of admin users need to know about the data-fetching internals. Perhaps investing effort in polishing these interfaces is not going to benefit a huge number of people, and it might be better to invest the nicer-function-argument effort at the next level, at the stage of reading data from those files into R. General-purpose packages like raster provide the nuts and bolts for this, but to use them requires specialist knowledge of how the data files and directories are structured. Data-reader functions that provide e.g. a date range argument (and translate that into appropriate raster calls to read the right files and bits of files) are enormously helpful to users who just want to use the data, because this relieves them of the need to know about the file structure details (see e.g. @mdsumner's https://github.com/AustralianAntarcticDivision/raadtools package as one example of this, alongside many others).

Anyway (sorry, that was a bit of a monologue!) none of that argues against splitting the package. But it does encourage some thought as to the best way to build upwards.

maelle commented 7 years ago

@raymondben thanks! I first read this too quickly and thought we'd have two submissions instead of one which isn't the case.

I'll put the submission on hold while you do the splitting, and after that @MilesMcBain & @lwasser can review the core package.

MilesMcBain commented 7 years ago

Thanks for your response @raymondben.

Firstly, one small point about the maintenance: With the split design you get much more freedom. You may not even put all your datasources on CRAN, you can choose to leave some on GitHub where the expectations about maintenance are much lower. In this way the maintenance burden could be reduced.

Secondly, I did get some of the way through my review before we put this on hold, so I thought it might be helpful to have the higher level feedback now while things are being actively worked on:

Following bowerbird.Rmd

cf <- bb_config(local_file_root="~/data")
cf <- cf %>% add(bb_sources("CERSAT SSM/I sea ice concentration"))
bb_sync(cf)

This kicked off a 2.5Gb download that blocked my R session which I had to eventually kill. Not a great first user experience ;). I know down the page you point out download size for this source can be reduced. But the total size of the source is not mentioned until the source list at the bottom.

I would like to make the following recommendations:

A question: is there a way to check sync status in interactive mode without fecthing these large repos?

Programming model

bb_config returns a sync configuration object. This would be in line with my expectations of API package usage. A very similar pattern is in the usage section of the ropensci/osmdata README.

Unlike osmdata::opq, bb_config seems unusual in that returns its configuration as attriubtes of an empty Tibble which is not mentioned in the documentation. It means to inspect configuration one needs to use attributes(cf) instead of say cf$ and using IDE autcomplete features in inspect object structure. I would argue this is unsual enough that it should feature promenantly in the documentation. Users need to be warned about applying operations to the Tibble that obliterate the attributes - very easy to do in the tidyverse.

I would also make the comment that I do not think this is the idiomatic tidyverse way of approaching things. I only mention this since it seems you use many tidyverse funcitons. I think the tidyverse way would be to use a list-column to capture this information. Take a look at how tidyverse/googledrive stores file metadata in the dribble() class for an example.

Finally, the add() function would need to have a 'bb_' prefix added to ahere to function naming guidelines.

raymondben commented 7 years ago

Thanks @MilesMcBain - a quick question on the config arrangement (and, yes, it is unfortunately easy to unknowingly lose the attributes). The reason for doing it that way is because those attributes apply to the whole configuration (and thus should only be stored once, not in every row) whereas the rows hold data-source-specific information that is not necessarily common across rows. An alternative structure that we considered would be to create a list object that has something like a data_sources tibble inside it (i.e. the existing bb_config rows) and a config tibble (holding what are now stored as attributes). This is pretty close to what opq does - would that seem better to you? I can't remember now why we didn't go that way. Probably some arbitrary decision of mine!

mdsumner commented 7 years ago

We might commit more deeply to a nested-tibble model for this one-to-many attribute problem, but I'm not sure that covers everything, definitely something that we haven't been sure about so the feedback is very welcome.

Also, @MilesMcBain I'm very sympathetic to this user-problem for a 2Gb download, but we are in danger of diluting the entire point of the abstraction of bowerbird - in that it's not ever going to be immediately obvious from a user-perspective why you'd want this package, since the value comes after you've decided to commit to getting all or most or of a particular collection - especially those parts that aren't yet published. Note that while the first-time sync-run is inconvenient, if it successfully completes then when it runs next time, tomorrow, next week it will be hugely fast as it will only check to see that it already has all but the handful of brand-new files from the source, and it will dutifully obtain those new files without any intervention by a human.

We could choose a weekly SST data set, that would be easily downloadable in a single file - the initial commitment for download is minutes or less - and the real value of bowerbird will become apparent next week when that file is updated by the provider, and the automated bowerbird process (that you configure to run nightly, or so) kicks in and detects a change.

But, not many or most users will care about SST as a variable of interest, and those that do will want much higher resolution in time and space, before they care. So, by definition and rather unfotunately the examples that are most compelling and immediately-reproducible are not going to have an actual audience, though they should have an abstract appeal in terms of what else can be done with this tool that is "relevant to me". I do want to ensure we discuss this audience definition, since we are ultimately targetting people like us, those that commit to a large and continuously maintained collection for shared use. (We are also trying to highlight that committing to this is very valuable, an actual game changer for the way we think and approach these issues).

I definitely agree about the "yesno" prompt for whether you really want to download a potentially huge collection, but again that interactivity, normal for "usual use" kind of gets in the way of the task of maintaining a huge collection of automatically updated data. We do see huge value in specific package applications, to provide smarts about available dimensional and variable filters, but we unavoidably end up in very specific areas. I am a bit wary of illustrating the value of bowerbird by exploring a particular data set to the level of actually reading it because as soon as we go that far we've moved away from the bowerbird abstraction of being an automatable, configure-anything, data-getter.

Thanks for the discussion and feedback! (I do think the weekly SST example is a good one, and @raymondben and I should look carefully at a user-case for illustrating a full workflow - and probably we need to tell the story of the pending updates that an easily-configured bowerbird robot will sort out for us tomorrow, next week, in the months ahead).

maelle commented 6 years ago

@raymondben @mdsumner just checking in. 😸 Have you made enough progress on the splitting for @MilesMcBain & @lwasser to soon review the package? Thanks!

raymondben commented 6 years ago

Hi guys, sorry, we've been held up - in part by a bug in sys but Jeroen has kindly resolved that now. I don't think we're far off, hopefully get back to you this week.

maelle commented 6 years ago

Ok thanks for the update!

raymondben commented 6 years ago

@MilesMcBain @lwasser @maelle

Sorry for the hiatus, think we're OK to resume the review process now. The major change is that the data source definitions have been removed, except for a few representative examples. (The full set of definitions has moved to https://github.com/AustralianAntarcticDivision/blueant. This is a separate package and not part of this ropensci submission, but it's there should you want to get some idea of how a themed-data-package might make use of bowerbird.)

Please note that until sys version 1.5 hits CRAN, you'll need to install this from github using devtools::install_github("jeroen/sys") prior to installing bowerbird. Bowerbird requires v1.5 in order to avoid a Windows bug in 1.4. Also note that check is currently failing on both Travis and AppVeyor, but because of some Ubuntu-package issue on Travis (that I haven't yet figured out) and on AppVeyor because of the sys issue. But bowerbird is passing check locally on both Windows and Ubuntu, so I think it's OK. Update: sys v1.5 is now on CRAN.

Some replies to Miles' partial review:

Not a great first user experience

Yah, fair enough. The example is now a small one, and in addition we've switched from using system2 to sys, which means that the user can interrupt the external process without needing to kill their session.

If this package is intended to be used exclusively from the command line ...

No, it's not exclusively this. See updated comments in README. Also, we can't give progress indicators (also discussed in README).

is there a way to check sync status in interactive mode without fetching these large repos?

I'm not sure what you mean by this, sorry.

bb_config seems unusual in that returns its configuration as attributes of an empty tibble ...

That structure has now changed and doesn't rely on attributes to carry the repo-wide settings. bb_config now returns a list with two elements: data_sources (a tibble) and settings (a list of repo-wide settings).

All exported function names are now prepended with bb_.

Update: fix found for Travis issue. Update 2: sys v1.5 is now on CRAN so no special preinstall needed for that, and AppVeyor is passing.

maelle commented 6 years ago

Thanks @raymondben! @MilesMcBain & @lwasser, you can now proceed with the review. 😺

maelle commented 6 years ago

Note: I've updated the reviews due date to the 2017-10-30.

MilesMcBain commented 6 years ago

Thanks for the update @raymondben. I look forward to reviewing the new version.

Regarding this comment:

is there a way to check sync status in interactive mode without fetching these large repos?

Since the package is intended to be used interactively I thought it would be helpful if I can check I am on the latest version without initiating a large download. If it turns out I need to sync, I'd probably like to tee it up in separate session so that it doesn't block my main work environment. It might be this is already possible and I did not see the function.

MilesMcBain commented 6 years ago

@maelle @noamross, I would like to flag I am not going to be able to make that review deadline. Would it be okay to have my review in by the 7th of November?

lwasser commented 6 years ago

What is the new deadline for the review? I'm hoping to start soon but this week and next are very busy for me as well.

sckott commented 6 years ago

Hi @MilesMcBain and @lwasser - stepping in for Maëlle here for a bit.

Okay, new deadline is 2017-11-07

raymondben commented 6 years ago

@MilesMcBain re:

is there a way to check sync status in interactive mode without fetching these large repos?

Oh, right, gotcha. The answer is sort of "yes, but no". There is bb_config(...,skip_downloads=TRUE) which will cause the subsequent sync process to do a dry run without downloading any data files. But in reality, this is less useful than I'd hoped. It suppresses any actual calls to wget, and so any recursive wget action (descending from the top-level data set URL) doesn't get replicated in the dry run. So quite possibly you won't get any indication of whether or not you have particular data files within the repo. For that reason, I'm debating whether the skip_downloads parameter is even worth keeping. And as far as I know, there's no way to instruct wget to report on files that it would download, but not actually download them. On first glance the wget --spider flag sounds like it might be appropriate, but I've discovered that in this mode wget will delete existing files, which is less than helpful.

lwasser commented 6 years ago

HI guys -- where is the best place to get feedback on issues i'm running into ? for example i tried to run the vignette on the readme and got this error

sh: wget: command not found
Error in bb_sync(cf) : 
  could not find the wget executable. Try the bb_install_wget() function, or install it yourself and ensure that it is on the path

i'd like to be able to work through the vignettes! NOTE: i'm on a mac, not windows.

raymondben commented 6 years ago

Hi @lwasser, feel free to raise issues here. I don't have/use a Mac, but I if you use Homebrew I think that you can just do brew install wget to install wget. I'll find someone here with a Mac and see if that works for me ...

lwasser commented 6 years ago

Thank you @raymondben ! I did try brew. it wants me to update ruby however and i've got a few ruby scrips that i'm afraid of breaking... so i'm running those now to get my data and then i can risk it breaking tonight :) . if there is an alternative install option please say the word. i'm looking around now.

UPDATE: got it brew install wget --with-libressl i am not exactly sure why that worked but i don't have to update ruby to phew :)

Still have this question:

raymondben commented 6 years ago

Nicely done. Clearly, better Mac support is something to put on our to-do list!

Re - questions:

  1. we don't have a particular view on intended users (the README/vignette touches on this a little). It could be users who need to keep local copies of data files and just want a convenient mechanism to do so. It could be a super-user who does the same on behalf of a number of local users (this is our use case; see below). It could be useful to package authors who need to retrieve data but don't want to reinvent code for mirroring, incremental updates, etc. It could be useful as a means of distributing example code (e.g. I want to send someone some demo code of some sort, but it needs 100GB of data files to run, so rather than send the data files directly or have to explain to the other person where to get them, I can just send them a bowerbird snippet).
  2. Uses cases - one is definitely the maintenance of a local library of data files. We use it to maintain a ~5TB or so library of satellite and other environmental data for use by a number of people in our local research community. See the blueant package mentioned above for our current list of ~50 data sources that we mirror.
  3. Yes please (but don't feel obliged to spend excessive time on it). I will not be surprised, however, if you were to say that it's all a bit opaque and difficult to get working! We've tried to pave the way, but I am sure it can be improved and that's something I'm hoping the ropensci onboarding can help with.
lwasser commented 6 years ago

wonderful. thanks @raymondben Ok well i did focus a lot on #3 above. And if my feedback is helpful -- i'm happy to continue to test things as the package developed :) . My apologies that this took longer than I know you all wanted it to.


Hi @mdsumner @raymondben @sckott @maelle

I want to first and foremost thank the developers for putting so much time and effort into this package. And I want to thank ROpensci for asking me to do a review. My reviewer perspective is from that of a regular user.
Specifically because I spend so much of my time teaching scientists, post docs, grads, undergrads, etc etc, i think a lot about what a typical scientists knows about data and APIs. Please take my review as a set of suggestions as I greatly appreciate the effort here! I did not run all of the spell check, etc tests - yet including spelling and such. Instead I spent a good bit of time downloading data and thinking about pain points for users of various skill and knowledge levels.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

For packages co-submitting to JOSS

The package contains a paper.md matching JOSS's requirements with:

  • [ ] A short summary describing the high-level functionality of the software
  • [ ] Authors: A list of authors with their affiliations
  • [ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
  • [ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

is this being submitted to JOSS? How do I know that?

Functionality

Final approval (post-review)

Estimated hours spent reviewing: so far i think i've spent about 5 hours reviewing with a few more playing with it when there was talk of it being split


Review Comments

Overall Concept

In general I think this effort is a great one. I like the idea of a data download utility that can help a user manage and maintain metadata around their downloaded data. It also supports increased usability and repeatability of a project on the "left side" of the spectrum - ie getting the data initially. I support and applaud this effort!

However I have a few general concerns that I think could be addressed. I'd like to better understand the use cases for this effort and the users technical knowledge. In general, most people don't fully understand how large some of these data are. And often the starting place is to "download everything" without recognizing that for instance their hard drive is quickly filling up.

Also these days things are moving away from local processing and storage and towards cloud based activities. Thus, I think this package could provide some guidance for users BEFORE they download and attempt to archive everything.

For instance,

I think some of these learning points are important to a successful package because so many users could quickly make huge API calls and fill up their hard drives unknowingly.

Use Cases

I'd like to see several clear uses cases for this package that include the user personas who are accessing the data - ie how much does the audience know for each use case? particularly as it relates to remote sensing data. I see many potential challenges for users who are less well versed in working with these data - particularly in the data size realm and dealing with API calls and associated arguments.

I'd also like to see a bit more documentation and warnings associated with potentially large requests and associated resource availability on a users machine.

Package Documentation

In general, the readme and introductory package text is written for someone with more technical experience and who has probably used API's before even though the audience as described is "researchers". If the audience is anyone who is looking to access data and build metadata around it, I think more clear documentation is required up front about what the package does and how to use it. I provide some specifics in this regard below!

I also think that all of the function documentation needs to be developed further.

Setup Feedback

First I installed the package from github. No issues with the installation.

install.packages("devtools")
library(devtools)
# install instructions missing space
install_github("AustralianAntarcticDivision/bowerbird", build_vignettes = TRUE)

# add this to the vignette just for clarity sake!
library(bowerbird)

also minor:

this URL doesn't work for me: http://oceancolor.gsfc.nasa.gov/

WGET Issues

When I ran through the vignettes, I immediately ran into issues with wget().

sh: wget: command not found
Error in bb_sync(cf) :
  could not find the wget executable. Try the bb_install_wget() function, or install it yourself and ensure that it is on the path

This error seems to be focused on Windows users whereas I am on a MAC. After shedding a few tears, I tried to use brew.

brew install wget

This approach tried to force me to update ruby which I wasn't prepared to do given other running tasks (twitter API extravaganza is underway over here!!). The code below, however worked. I actually don't know why. It must have been the work of a helpful magical make-things-install-right gnome.

brew install wget --with-libressl

Notes:

To install wget requires admin access to your laptop. If you are on a govt or university issued computer that is domain controlled, you may or may not have access to install wget. I have super user powers on my university issued MAC ONLY BECAUSE I BEGGED AND PLEADED and downright insisted. And because our IT person is awesome and gets it. So I was able to use to to install. I think wget could be a pain point for many users who are not advanced. If you want this to be widely used, i'd spend just a bit of time beefing up the documentation about how to install this on Windows and Mac. And let people know they may need admin rights...

Errors also should be specific to windows or mac or else you'll get Mac users unknowingly trying to install windows things. I've seen this happen countless times.

Suggested Actions

Style

  1. Please follow the tidy styleguide: http://style.tidyverse.org/ All vignettes, documentation, etc. are missing spaces around function argument assignment operators and after commas.
    • Some code and comments are well beyond 80 character suggested width for readability.
    • Could a lint test be included in the TravisCI setup to ensure style consistency?

SETUP

  1. Add library(bowerbird) to the readme
  2. Add more explicit instructions and discussion about wget(). Add instructions for both Mac and Windows and be explicitly about which install path is for which OS. If you expect to only support advanced users, then they may be able to figure it out.
    • Adjust error messages surrounding wget() to be windows vs mac specific to not mislead users - like me :) .

Specific Recommendations

Vignettes

bb_config() setups up the data "warehouse" directory where data will be kept. The bb_source() function fails if a directory doesn't exist.

Recommendation: Test whether the input directory exists in bb_config()

# setup - a check on the directory here? !dir.exists(~/Documents/data/bowerbird/) or assert...
cf <- bb_config(local_file_root="~/Documents/data/bowerbird/")

Next, I tried to define a data source.

bb_source()

This function is dense with a lot of powerful arguments. However it is so dense, that a user without some API experience could get lost and overwhelmed. It would be very helpful to break down this function into key parts such as:

  1. Metadata arguments that support the "database" created by bowerbird
  2. API arguments that are required to download the data

Please note that API arguments are not straight forward for most users. If your audience is really researchers and scientists rather than power users of APIs and technology experts in general, you will need more documentation.

Notes:

Here are some suggestions to consider:

  1. Provide the most bare bones of bare bones vignettes that only use required arguments. Currently the vignettes use both required and non required arguments.
  2. Consider breaking up this mega function into two parts. Maybe one is a list or vector of metadata items and two is a list of api items? Or perhaps there is another clever way to help a user generate this function call?
  3. At a minimum break up the function documentation to have two sections of arguments to make the API vs the metadata explicit
  4. I think there is a function passed to an argument here: method=quote(bb_handler_wget). This will confuse a user because when you look up the documentation for this, you see additional arguments.
  5. Rework documentation: for instance the earthdata function documentation page, only provides a link to how to register for the website. I am left uncertain about how to actually use the API to get earth data - data. And i've used that website a good bit - granted last year's version of it anyway. :)
  6. Vignettes that explain how to identify the source_url string would be helpful. It is not clear to me how i would setup access to an earth data download. For instance - what if i wanted to get snow cover from NSIDC? https://n5eil01u.ecs.nsidc.org/VIIRS/VNP10.001/ Or what if i wanted landsat vegetation indices? Vignettes that explain how to access specific datasets (perhaps ones identified in your use cases) would help significantly. On the earth data website it actually took me a while to find any URL's that lead to data. This isn't Bowerbirds problem - it is Nasa's. BUT could this package guide a user through how to find an appropriate source_url for a dataset? Just a could examples...
  7. I didnt notice this in the vignettes and perhaps i missed it in the code. is there a summary function that gives you an overview table of all of the data in your "warehouse"? if there is that would be useful and please add that to the vignette
  8. Why is this called bowerbird? The name doesn't fully seem to relate to the library function BUT maybe i need to read more about bowerbirds?!

Provenance / Flexibility / Adaptability

Next, I tried to get some data from nasa earth data. I have used this site a good bit to get RS data and noticed that the entire site has changed since last year. Of course it all happened right before my fall class that i use it with!

Here are some questions for the developers:

  1. How does the package handle data updates? For instance, last year USGS reprocessed all of the Landsat 8 data with the create of collections. Thus, all of the data that I downloaded last year, are different now. Different naming conventions, some changes in files such as the cloud cover layers, etc. Will Bowerbird help a user identify this type of change and guide them through how they want to handle the change (ie re-download data, start a new library with the newly processed data)? How will it do that?
  2. The nasa earth data site has also changed considerably since last year. Again, how will bowerbird handle / warn users of changes? We hope that DOI's will persist but with if the new available data have a different naming / metadata structure ... what happens?
  3. Many of these datasets are extremely large. Does bowerbird help a user track resource availability on their computers - especially as it's downloading zips and unzipping? Bowerbird could be a beautiful helper package to manage data. It also could allow for less experiences users to make big mistakes in terms of trying to download data that are too large for their computing resources.
    • Can bowerbird help a user not make huge requests or provide a warning? The beauty of this package is the ease in which you can download and maintain data. The danger is the same - its very easy to kick off some huge calls that may be fill a users hard drive.
  4. In general I have a question that i'm sure everyone involved has grappled with: Do we want to encourage people to store data locally? Generally remote sensing data are so large that we may not even want to process it locally. Things are moving towards the cloud and we do want to avoid people starting to build their own mini DAACs on their machines. This is a general comment to consider. This package makes me think a lot about the behaviors i've seen with my students and colleagues at workshops accessing data.

Truly great effort and tremendous work has obviously been put into this package! Please take this feedback as an attempt to help with the eye of a regular (non power) user. :)

Also please let me know if I missed some key parts of this review. It is my first for ROpensci!! :)
--Leah

maelle commented 6 years ago

Many thanks for your review @lwasser! 😀 Great feedback! Could you just go over the checklist at the beginning of the review template and tell me what was not checked because you didn't have time to do the check vs there was an issue?

@milesmcbain today is the deadline 😉

lwasser commented 6 years ago

done!! @maelle there were so many parts to this review and i didn't have time to get to each and every step. I spent a lot of time thinking about how someone would use it and testing various api's. i tried to be more descriptive above.

maelle commented 6 years ago

I missed the comments because there was no :heavy_check_mark: :see_no_evil: , this is perfect and no problem about not doing everything since your review is very rich and useful. I will do the rest after Miles' review is in.

lwasser commented 6 years ago

oh so sorry @maelle i didn't realize that was a check for me! I was thinking the ✔️ was me signing off on the fact everything was there in the R package 🙈 ! :) my mistake. I did go through the list. I'll check it off now!!

maelle commented 6 years ago

No you did it right, checking is when things are as good as you want them to be but stupidly I had missed the comments in the list, all my fault !

lwasser commented 6 years ago

<<unchecking in process now 😂 ... >> i missed the conflict of interest ✔️ so i did that. Thanks @maelle

maelle commented 6 years ago

Thank you for your patience and work!

raymondben commented 6 years ago

Thanks @lwasser - plenty of good points in there!

lwasser commented 6 years ago

Thank you @raymondben . I really respect all of the work that you and @mdsumner put into this package and am excited to see it develop :)

MilesMcBain commented 6 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

For packages co-submitting to JOSS

The package contains a paper.md matching JOSS's requirements with:

  • [ ] A short summary describing the high-level functionality of the software
  • [ ] Authors: A list of authors with their affiliations
  • [ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
  • [ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

Final approval (post-review)

Estimated hours spent reviewing: 15


Review Comments

Thanks for your hard work @mdsumner and @raymondben! I applaud the changes you have made since my initial feedback. My first experience with downloading the electoral data could not have been a bigger contrast to last time. It was one of those: "By what magic is this made possible?" moments. I now see bowerbird as a powerful tool for managing sources and syncing data from the web. You obviously get great mileage out of it yourselves and I think that is a solid foundation for a great package. The main question mark I have after my review is about how accessible this package will be to the wider community.

This concern arises from a number of areas: I find the documentation is too minimal for many functions. There are still some unconventional R styles used that hindered my ability to understand the package. A level of comfort with wget seems assumed in the package that did not hold for me and I am not sure holds widely.

In my comments below I detail my concerns. Please take them for what they are: my attempts to improve the wider appeal of your valuable work.

Vignettes

In the self-titled vignette the use of the postprocess argument is only mentioned in passing with respect to decompression. I'd argue it is major functionality with enough usecases that deserves its own subheading.

As I mention in the function documentation section. This package could benefit from a greater breadth of examples with commentary that explains how features of the datasource result in each choice of method flag and postprocessing function.

Improve function documentation

In general the function documentation is minimal to the point of being somewhat unhelpful. I learned the most from the examples and source code, but this involved jumping back and forth between many functions. Here are some comments by function:

Explain config first argument

Either the help or vignettes or both need to clarify the mysterious first argument of bb_handler_wget, the bb_decompress family, and bb_cleanup - the vignette examples and the built-in example data sources use these functions without supplying this first argument, even though it has no default. It was only after digging into the code of bb_sync that I saw why this argument exists and how it is populated.

To make this clear you could mention in the help for the arg that it will typically be populated by bb_sync and does not need to be specified in postprocessing or method.

On a related but more general note, you might also cover in a vignette that anyone needing to implement their own handler needs to write a function that must take the special arguments config, verbose, and local_dir_only in addition to any context specific args.

Ditch dplyr or update to tidyeval

dplyr is used very little within this package. So I question its worth as a dependency, given the amount of installation time and number of recursive dependencies it adds. I can imagine bowebird being used with automated reproducible pipelines where this may be an issue.

Another issue is that the deprecated 'underscore' verbs (mutate_, group_by_, select_, rename_) are sprinkled around this package. For the sake of future package maintenance, these would need to be updated to to the new 'tidyeval' style or removed. Given their very light usage I would recommend just replacing with base/purrr with a view to removing dplyr altogether.

In the case of bind_rows, reduce(., rbind) can be used guilt-free for the small dataframes you are manipulating here.

Unusual usage of quote

At first I was confused as to the vignette examples quote the functions supplied to the method and postprocess.

After viewing the examples in various function help files, my understanding now is that you intend quote to be the mechanism with which users pass arguments with their specified functions. Is this correct? For me, this is quite unusual and not in keeping with the conventions I am familiar with in both base R or the tidyverse. I found a typo in the bb_cleanup example that supplied the arguments to postprocess in the style of *apply family, I find this telling about how unexpected this pattern is.

I am far from an expert on R internals, but I am not sure if quote is safe way to do this since it may produce strange results when variable names are passed instead of literals due to lexical scoping.
I think this needs mechanism needs to be reconsidered. If it is not changed to be more in line with conventional approaches, then it at least needs to be prominently flagged in the documentation.

Make some wget method flags into arguments

In bb_config you did a really nice thing where you created a local function argument (clobber) to provide a nicer API to the wget clobber arguments. The help for this argument is also quite detailed giving me confidence I could use it correctly if need be. I think this pattern is the bar that the rest of the package should aspire to.

In the self-titled vignette you state:

You will probably want to specify --recursive as part of the method_flags, even if the data source doesn't require a recursive download.

before going on to explain the technical reason for this. So now --recursive is effectively boilerplate I will dutifully replicate in all my data sources until I get an error - because you said I should. This it not how you create a user friendly API that inspires confidence. I would argue that a much clearer thing to do would be to create an argument like: recursive = TRUE with an explanation of the rare case I need to think about changing it in the function documentation.

Another candidate would be --level=1 which seems to pop up frequently in the examples. If you made it a first class argument of some bowerbird function you have prime opportunity to explain it in the documentation. Yet another candidate would be --accept - a VERY useful argument.

I am not saying you need to provide a mapping to all wget args. From what I gather there are probably less than a half-dozen that will cover > 80% of usecase for your users. You could still have a more-args for additional arguments to wget if you want. But I think making some commonly used flags into first-class arguments would dramatically improve the appeal of this package. YOU are the experts on wget and what you would be saying to us novices is: "Don't worry, I've understood all the wget arguments so you don't have to. These are the ones you care about and this is how you use them". You'd have a much more welcoming package that doesn't have to direct its users to a GNU manpage in the opening paras of its vignette!

The function for these args to live in is probably bb_handler_wget(). This might be a bit safer if you move away from the use of quote as I suggest above.

As things are there is also some weirdness with the fact that some wget flags are handled automatically by the handler eg: -no-if-modified-since --restrict-file-names=windows --progress=dot:giga while others are expected to be supplied. What happens if they clash?

Misc Comments

maelle commented 6 years ago

Thanks for your review @milesmcbain! 😀

@raymondben @mdsumner both reviews are now in. ☺

raymondben commented 6 years ago

Thanks @MilesMcBain and @lwasser both for your detailed and thoughtful input. Mike and I will have a pow-wow and plan our way forward from here.

mdsumner commented 6 years ago

Yes and thanks! There's a lot of really helpful feedback here, it's very good.

maelle commented 6 years ago

@raymondben @mdsumner any progress? Happy end of the year!

maelle commented 6 years ago

@raymondben @mdsumner any progress? Happy New Year!

raymondben commented 6 years ago

Hi @maelle - think we've done most/all of the major code changes (see postrev branch), now a matter of revising all documentation & vignettes. Unfortunately this isn't high priority work right now, so it may not be fast turnaround ...

maelle commented 6 years ago

Okay thanks, what's your timeline then?

@lwasser has just been recruited for another review by @karthik 😉

raymondben commented 6 years ago

@maelle actualy probably not far off now, we have given it a good thrashing in the last few days. In the next week or so, with a bit of luck?

maelle commented 6 years ago

@raymondben @mdsumner just checking up on this submission, do you think you'll soon be able to give an update to reviewers? Thank you! 😺

raymondben commented 6 years ago

Response to reviews

@MilesMcBain @lwasser - thanks again for your reviews. Responses to each of your points appear below. Please note that the revised version of bowerbird is in its postrev branch. It will be merged into master once the revisions are complete and dependent packages have been updated to match.

Other notes: there is still not a lot of documentation to help power users who want to write their own handler functions (there is a brief section in the vignette). However, since this is likely to be a vanishingly small number of users we don't feel it's a showstopper right now. There is an issue for this though so that it doesn't drop off the radar (https://github.com/AustralianAntarcticDivision/bowerbird/issues/15).

Leah - Review Comments

I'd like to see several clear uses cases for this package that include the user personas who are accessing the data - ie how much does the audience know for each use case? particularly as it relates to remote sensing data. I see many potential challenges for users who are less well versed in working with these data - particularly in the data size realm and dealing with API calls and associated arguments.

Various changes have been made to address this, notably the "Users: level of usage and expected knowledge" section has been added to the README and vignette, along with an expanded set of examples and guidance on writing data source definitions.

I'd also like to see a bit more documentation and warnings associated with potentially large requests and associated resource availability on a users machine.

bb_sync now takes a confirm_downloads_larger_than parameter, which (during interactive use) will cause the process to pause for confirmation before downloading a source larger than the specified size (100MB by default). Re: "resource availability on a users machine" - see related comments below.

I also think that all of the function documentation needs to be developed further.

Yup. All function documentation has been revised.

Minor but i do suggest adding the library(bowerbird) call to the readme vignette

Done.

this URL doesn't work for me: http://oceancolor.gsfc.nasa.gov/

Ooops, should have been https, sorry. Whatever browser you were using should have redirected, but obviously it didn't. (Changed)

Please follow the tidy styleguide: http://style.tidyverse.org/ All vignettes, documentation, etc. are missing spaces around function argument assignment operators and after commas.

Style is to some extent personal preference (indeed, the tidyverse style guide itself says that "All style guides are fundamentally opinionated ... Some decisions genuinely do make code easier to use ... but many decisions are arbitrary."). Unless it's mandatory for ropensci (we don't believe that it is?) we would prefer not to follow it.

Add more explicit instructions and discussion about wget(). Add instructions for both Mac and Windows and be explicitly about which install path is for which OS. If you expect to only support advanced users, then they may be able to figure it out. Adjust error messages surrounding wget() to be windows vs mac specific to not mislead users - like me :) .

All the wget-related stuff has been revised with more platform-specific instructions, and clear notes on bb_install_wget to indicate that it only supports Windows (at this stage, anyway).

bb_config() setups up the data "warehouse" directory where data will be kept. Recommendation: Test whether the input directory exists in bb_config()

We do that check in bb_sync, because bb_config simply sets up a configuration object. There is no assumption that it will necessarily be used on the user's local machine, or run right now (maybe it will be run on a different machine, or at a later time when the directory might have been created by some other process). The function documentation is pretty clear about this behaviour (see the create_root parameter in help("bb_sync")). We haven't changed this.

When I first look at the example below, if I am a less experienced user, I may think that the reference is actually the source_url.

reference is now named doc_url

I also may be overwhelmed by number of arguments that I need to parse through.

That's unfortunately the nature of the beast. For what it's worth, we don't particularly enjoy the complexity either. But in order to be a versatile, flexible tool that can cope with arbitrarily-structured repositories from different data providers, there are a lot of arguments to cover different situations.

Here are some suggestions to consider: Provide the most bare bones of bare bones vignettes that only use required arguments. Currently the vignettes use both required and non required arguments.

The vignette now has four examples of writing a data source, stepping through various levels of detail and demonstrating all three of the packaged handler functions.

Consider breaking up this mega function into two parts. Maybe one is a list or vector of metadata items and two is a list of api items? Or perhaps there is another clever way to help a user generate this function call?

Not that we are aware of.

At a minimum break up the function documentation to have two sections of arguments to make the API vs the metadata explicit

As far as we are aware there is no way to have separate sections of arguments with roxygen. However, each mandatory function parameter here is clearly labelled as "required".

I think there is a function passed to an argument here: method=quote(bb_handler_wget). This will confuse a user because when you look up the documentation for this, you see additional arguments.

The use of quote has now been removed: these are now specified as lists that will be more familiar to users, i.e. method=list("function_name_as_a_string",argument1=value1,argument2=value2). Also the additional arguments have now been hidden from the user (see more detail about this in the response to Miles' comments below).

Rework documentation: for instance the earthdata function documentation page, only provides a link to how to register for the website. I am left uncertain about how to actually use the API to get earth data - data. And i've used that website a good bit - granted last year's version of it anyway. :) Vignettes that explain how to identify the source_url string would be helpful. It is not clear to me how i would setup access to an earth data download. For instance - what if i wanted to get snow cover from NSIDC? https://n5eil01u.ecs.nsidc.org/VIIRS/VNP10.001/ Or what if i wanted landsat vegetation indices? Vignettes that explain how to access specific datasets (perhaps ones identified in your use cases) would help significantly. On the earth data website it actually took me a while to find any URL's that lead to data. This isn't Bowerbirds problem - it is Nasa's. BUT could this package guide a user through how to find an appropriate source_url for a dataset? Just a could examples...

Documentation has had a rewrite, as mentioned above. We feel that detailed tutorials on using data provider websites are beyond the scope of package documentation. Nevertheless, the vignette now includes a fuller example of writing an Earthdata data source. For the record, that URL you give above is your data source URL: it points directly to that data set.

I didnt notice this in the vignettes and perhaps i missed it in the code. is there a summary function that gives you an overview table of all of the data in your "warehouse"? if there is that would be useful and please add that to the vignette

Yes, bb_summary does this. The data source list at the end of the vignette is produced using this. The vignette now draws more attention to this function.

Why is this called bowerbird? The name doesn't fully seem to relate to the library function BUT maybe i need to read more about bowerbirds?!

Bowerbirds (real ones) collect shiny things to decorate their bowers with. Bowerbird (the package) lets you collect nice shiny data sets and add them to your collection.

How does the package handle data updates? For instance, last year USGS reprocessed all of the Landsat 8 data with the create of collections. Thus, all of the data that I downloaded last year, are different now. Different naming conventions, some changes in files such as the cloud cover layers, etc. Will Bowerbird help a user identify this type of change and guide them through how they want to handle the change (ie re-download data, start a new library with the newly processed data)? How will it do that?

Bowerbird just downloads whatever the data provider makes available, following whatever the data source configuration parameters are. As explained in the vignette, bowerbird doesn't know in detail about the data structures or anything else. If they change, the new version will be downloaded.

Many of these datasets are extremely large. Does bowerbird help a user track resource availability on their computers - especially as it's downloading zips and unzipping? Bowerbird could be a beautiful helper package to manage data. It also could allow for less experiences users to make big mistakes in terms of trying to download data that are too large for their computing resources.

Each data source has a collection_size parameter that gives an indication of the disk space required. The bb_sync function will now also prompt for confirmation on large downloads. Beyond that, we do not attempt to manage a user's own disk - that's the user's responsibility.

Can bowerbird help a user not make huge requests or provide a warning? The beauty of this package is the ease in which you can download and maintain data. The danger is the same - its very easy to kick off some huge calls that may be fill a users hard drive.

A confirm_downloads_larger_than parameter has been added, which defaults to 100MB. If R is running in interactive mode, user confirmation will be sought before initiating downloads larger than this.

In general I have a question that i'm sure everyone involved has grappled with: Do we want to encourage people to store data locally? Generally remote sensing data are so large that we may not even want to process it locally. Things are moving towards the cloud and we do want to avoid people starting to build their own mini DAACs on their machines.

Yes, as a general statement it's probably reasonable to encourage users to avoid individually downloading large data sets (especially multiple individuals within the same institution downloading the same large data set). But at some point someone has to store data locally. For example, we routinely run analyses that require the entire satellite data record of some parameter (e.g. sea ice or SST). This can't be done by issuing remote data requests, it would be cripplingly slow. Yes, we can run these analyses "in the cloud", and we have virtual servers for exactly this purpose ... but then where does that virtual server get its data from? That data storage ideally needs to be on the same local network as the compute facility so that access is fast. Someone, somewhere, has to maintain that collection of data files. This is one use case for bowerbird. And having one data librarian maintain a collection of data sets on behalf of a number of local users will reduce the need for individuals to download their own copies.

I didn't see a contributing file or description of how contributing works for this package. I also don't see bugreports / maintainer

The bugreports URL is in the DESCRIPTION file. Maintainer is also there (but it's hidden! The "cre" (creator) role in Authors@R gets interpreted as maintainer). CONTRIBUTING.md and code of conduct files have been added.

is this being submitted to JOSS? How do I know that?

No, it's not. (You can see that in the submission details in the onboarding issue - I don't know if as a reviewer you get any special notification of that though).

NEWS is missing

There is no NEWS file yet: to date, the package has been in an active development stage with everything liable to change, so it doesn't make sense to document all this via NEWS. We will add a NEWS file once the onboarding process is complete and things seem stable enough to warrant it.

Miles - Review Comments

In the self-titled vignette the use of the postprocess argument is only mentioned in passing with respect to decompression. I'd argue it is major functionality with enough usecases that deserves its own subheading.

It now has a subheading.

As I mention in the function documentation section. This package could benefit from a greater breadth of examples with commentary that explains how features of the datasource result in each choice of method flag and postprocessing function.

See response to Leah's review above - the README/vignette now has four examples of increasing complexity, and demonstrating all three of the packaged handler functions.

Improve function documentation

General improvements all round have been made.

bb_settings - Documentation does not explain the usecase for this function. It is not clear why it would be needed in addition to bb_config. The oceandata example made this clearer.

Function documentation now explains its role a little more clearly.

bb_validate - Documentation does not explain why this function does or is used for.

This function is now not exported - it is useful internally but probably not for the user.

bb_cleanup - The use of the bb_config argument is mystifying. The function is specified as part of the config? bb_decompress - As above.

For these functions, as well as the handler functions (bb_handler_wget, bb_handler_oceandata, bb_handler_earthdata) there are several arguments that are passed automatically when bb_sync is run, and which don't need to be specified by the user. Previously these were exposed as regular function arguments, which was indeed confusing. These have now been masked from regular users by use of the dots argument. Note that developers who wish to write their own handler functions will need to know about these arguments: there is a section of the vignette that will explain this. But regular users, who will almost certainly be the overwhelming majority of the user base, will find the new structure much less confusing.

The documentation for the handler and other functions that are called by bb_sync now also makes it clear that these functions are not intended to be called directly by the user, but specified as part of a data source definition.

bb_fingerprint() - Unclear how this function is expected to be used. It gets a mentioned in the 'Data provenance' vignette but that does not explain how the output should be used to detect changes.

Data provenance, and frameworks to help manage it, are evolving areas. The bb_fingerprint function was added to bowerbird following our presubmission enquiry, during which Noam asked for this sort of functionality to be added. Unfortunately, and as explained in the README, bowerbird doesn't know a great deal about the data sources that it is downloading. It generally just starts at the top-level URL and recurses downwards. So there is a real limit to how much it can help with data provenance. The bb_fingerprint function basically provides a snapshot of a data source's files. It's not clear to us how useful this is, and it's not intended to be a mature aspect of the package. As R packages for data provenance evolve, it may become clear how to better integrate the bb_fingerprint-style functionality with those frameworks, and we can work towards that. Alternatively, it might become clear that bb_fingerprint is basically useless, and we can drop it.

bb_handler_wget - Unclear on the effect or valid usecase for local_dir_only arg. Did not appear to affect the operation of bb_sync.

Is essentially an internally-used argument, which (like the config and other automatically-passed arguments to handler functions) has now been hidden from the user.

bb_source - The documentation for this key function is not sufficient for me to be able to use it confidently. access_function sounds like it is important but does not appear in any examples and the explanation is vague. The example uses some method flags that are not explained. I think this needs more examples and more commentary of said examples. Explain the nature of the web page and data and how that resulted in the choices of each method_flag, and postprocess step. Users would benefit from multiple examples of increasing complexity so they can get a sense of the usespace of bowerbird.

Function documentation has in general been improved, and the vignette examples give more detail.

Explain config first argument Either the help or vignettes or both need to clarify the mysterious first argument of bb_handler_wget, the bb_decompress family, and bb_cleanup - the vignette examples and the built-in example data sources use these functions without supplying this first argument, even though it has no default. It was only after digging into the code of bb_sync that I saw why this argument exists and how it is populated.

See above: now hidden from the user.

On a related but more general note, you might also cover in a vignette that anyone needing to implement their own handler needs to write a function that must take the special arguments config, verbose, and local_dir_only in addition to any context specific args.

Ditch dplyr or update to tidyeval

dplyr has been ditched!

Unusual usage of quote

Now removed: the parts of data source definitions that need to specify functions (and possibly arguments) now do so via a list("function_name_as_string",argument1=value1,argument2=value2) type syntax.

Make some wget method flags into arguments In bb_config you did a really nice thing where you created a local function argument (clobber) to provide a nicer API to the wget clobber arguments. The help for this argument is also quite detailed giving me confidence I could use it correctly if need be. I think this pattern is the bar that the rest of the package should aspire to.

The bb_wget function now has explicit arguments for a number of common wget command line parameters. The function documentation for each of these explains their use more fully (generally, the underlying wget help text is copied or paraphrased, with additional bowerbird-specific explanation in some cases). This change (along with dropping quote also makes the data source syntax more appealing, because these arguments can be passed as regular list elements.

what you would be saying to us novices is: "Don't worry, I've understood all the wget arguments so you don't have to. These are the ones you care about and this is how you use them"

We like your optimism! wget still gets us scratching our heads on occasion. But for all its faults and complexity, it is a powerful tool.

As things are there is also some weirdness with the fact that some wget flags are handled automatically by the handler eg: -no-if-modified-since --restrict-file-names=windows --progress=dot:giga while others are expected to be supplied. What happens if they clash?

There are a number of wget flags that can clash, and other than trying them all in combination we are not aware of any way of knowing how those clashes are resolved. That's really an issue with wget itself.

I did not find top-level documentation using ?bowerbird as per rOpenSci packaging guidelines.

Added.

Widespread usage of cat(sprintf(...)) for errors/warnings etc. Should replace with message()/stop() as per guidelines.

This is a little more complicated than it may seem. First, the majority of these cat usages are for progress information, not for errors or warnings. We don't think that using message for these is ideal: firstly, message (according to its help) is intended for "diagnostic messages which are neither warnings nor errors, but nevertheless represented as conditions". Vague though this definition is, we don't think progress information fits into this category. Secondly, and perhaps more importantly, bowerbird will be used in unattended/scheduled job mode, for which users will want to capture all output into a log file. Using cat(...) consistently means that a simple sink by the user will capture all of this output. The guidelines that suggest using message make this suggestion because it's easier for the user to suppress output if they want to. For bowerbird, specifying verbose=FALSE (which is now the default setting) will suppress all of this output.

There are, however, some instances where we use cat for what appear to be errors or warnings. The reasons for this are again because of the unattended usage model, and also because of the sequential nature of the sync process. bb_sync has a catch_errors parameter. If TRUE (which it is by default) then errors are caught and the immediate action is cancelled, with the error message issued via cat - but the overall sync process continues. Otherwise an error in one data source would prevent all others from running. Issuing a warning would tend to be unhelpful in this application as well, because the user would not receive the warning in the context of its accompanying progress information [with bb_sync(...,verbose=TRUE)] and it would be very difficult to use that information to track down problems. Even more annoyingly, some parts of the bb_sync process generate things that appear to be errors, but which are isolated failures in some part of the process. stopping on these is not appropriate, because it will halt the entire sync process for that data source. An example of this is calling wget to do a recursive download. If there is a 404 server response to any request during that download, wget will return a non-zero status code. Often these 404's aren't errors that matter (e.g. broken external links), so we don't stop.

For functions outside of the bb_sync chain (which won't typically be run in unattended mode, and which aren't issuing "progress information" as above), we do use message, stop, and warning.

We've gone through all usages of cat and ensured that they are consistent with the above logic.

Test coverage is good overall but sticks mostly to 'happy path'. Untested error paths appear in sync.R, utils.R, wget.R, postprocess.R etc. I feel like a jerk for bringing this up. I know how unfun it is to write test cases for these.

We've added a few more error path tests, and will continue to add appropriate tests to the suite as development progresses.

No Maintainer in DESCRIPTION.

Actually there is - the "cre" (creator) role in Authors@R gets interpreted as maintainer.

maelle commented 6 years ago

Thanks! @lwasser @MilesMcBain could you please have a look at the answer?

Reg. the tidyverse style, adding white spaces will make the code much easier to read and using styler it should be a one function call (styler::style_pkg()). 😉

lwasser commented 6 years ago

hi @maelle i think (without actually going through another review) that their feedback looks very thorough and patient considering all of the review comments :) i also didn't know about styler !!! so i'm excited to try that. I have another review that i'm supposed to do by feb 10 so please consider that if there is another round with bowerbird! i'll likely be doing the modistsp review close to the 10th given my current schedule :)