Presub enquiry: bowerbird

raymondben commented 7 years ago

See https://github.com/AustralianAntarcticDivision/bowerbird

This is a data retrieval package, but I'm enquiring as to whether this fits within your desired scope for such packages. Specifically, it facilitates downloading and maintaining a local data repository from one or more third-party data providers. It comes with (at latest count) 50-odd predefined data sources, generally themed around marine and Antarctic environmental data. The basic usage is to define a configuration of one or more sources that you want mirrored locally, then run the synchronisation function on daily or other regular intervals to keep your local copy in sync with the remote masters. Most of the download functionality is achieved by passing calls to wget and asking it to mirror the remote source, although a few of the pre-defined sources have more elaborate handlers.

It seems to me that this provides a nice complement to existing ropensci packages that provide on-demand retrieval of remote data, e.g. rerddap. However --- and hence this enquiry --- you might view it as being in the "too general" category described in the onboarding doc (i.e. it's something like a service rather than a source or topic).

It is intended to be relatively easy for users to write their own data source configurations (and there are no particular constraints on what those data sources could be --- anything that can be retrieved by wget). In that sense the package could certainly be viewed as a "general" data retriever. But by the same token, it could be viewed as a scaffold package upon which more themed/targeted data retrieval packages could be built, and so in that context I think it would be a useful addition to the ropensci packageverse.

Thoughts?

Update: @mdsumner reminded me of https://github.com/ropensci/rdataretriever, which is superficially similar. I don't think there's irresolvable overlap, though: rdataretriever seems to be angled towards biodiversity data sets in particular and creating sensible local database structures for them. Bowerbird has no particular thematic orientation (although the pre-packaged definitions are environmental/marine/Antarctic) and simply mirrors remote data locally, without attempting to restructure it. Bowerbird only requires wget (and will install it on Windows if you don't have it) whereas rdataretriever requires the (python) retriever package.

maelle commented 7 years ago

Thanks for your enquiry @raymondben! In order to help the discussion could you provide us with examples of scientific uses cases?

raymondben commented 7 years ago

Sure. This underpins a bunch of science work through Australian and international Antarctic programs, and based on our experiences I'd envisage similar uses in many other science fields.

@mdsumner and I use this package (and its predecessor raadsync) to maintain a repository of Southern Ocean environmental data from satellite and model sources. This is used by a range of Southern Ocean researchers in our local community at IMAS, ACE-CRC, AAD. It's local and hosted on the same infrastructure as our compute fabric, so it's fast and therefore can be used for a range of science work that can't be done by pulling remote data on demand. It also means that researchers don't need to maintain their own copies of these data, saves storage and bandwidth, fosters local science community expertise, etc.

Some examples of the types of science that are enabled by this:

We produce a range of derived summary layers (long-term means and variabilities, Southern Ocean-specific products; e.g. these and these). These are intended for use in species distribution modelling, regionalisation, ecosystem modelling, and similar. Production of these summaries requires full data collections (e.g. the entire satellite sea ice data record) stored locally, and so bowerbird takes care of the mechanics of maintaining that collection. Some example papers using these data.
Animal tracking is a very common use case. Tracking of animals by GPS or similar tags is a cornerstone of marine ecosystem science, and their movement and behaviour is typically examined in the context of their environment using both localized (i.e. matched to the animal's track through time and space) and broader regional environmental data. These analyses typically require local data copies, both because on-demand extraction becomes inefficient quite quickly, but also because many of the environmental data are processed in some way to make them more biologically relevant, similar to the derived layers described above (e.g. the time since the sea ice last melted is a common example of this, which requires establishing a time-history of sea ice data at every location of interest). (Here are some example papers).
Assessment of climate change impacts, designing monitoring programs, conducting risk assessments, evaluating extreme events, and ecosystem modelling all typically require access to regional-scale (whole-ocean-basin) data across the full satellite/model era, and so can't be conducted without a local archive of such data.
Packages such as RAADtools, tidync and R ocean colour that are built around having direct access to data files. For these packages (and their research objectives) to be useful to a wider community, other researchers need to be able to build their own local repository of data files. Bowerbird provides a straightforward way for them to do that. A bowerbird configuration for a particular set of data files can be distributed to users so that they can easily create their own local repository copy.

@mdsumner, anything else to add?

mdsumner commented 7 years ago

I'd just add to 4) the "meta benefit" that we found in having these data available locally is reduced cognitive load upfront when a new application or research request comes along. If we already have the data there's no obstacle to getting started immediately with the data extraction or data processing task. If we don't have the data we know we can configure bowerbird to go and get it, and so we can have enough of the collection to try the new ideas within a few hours. This has also improved our development of domain specific code as well, because we automatically have a rich test bed for ensuring our read/extraction/process schemes are efficient and scaleable. We can provide a high level front-end that hides all the details, or we can provide access to others to the raw data for them to make their own decisions about any aspect.

noamross commented 7 years ago

If we separate out the central functionality from the data sources included, this package is closer to the "reproducibility" category, and in some ways similar to the "database wrapper" category. As such, it would probably have a stronger case if there were some mechanism for versioning/provenance, or at least compatibility with such frameworks. The general sense of the editors is that yes, we would send this for review, but we would assign reviewers with some expertise in versioning/provenance and expect them to provide suggestions on this front. If there is some of this in there, do highlight it in your documentation.

A couple of other thoughts:

It would be helpful to actually separate out the core mechanism and additional sources. This could go as far as having separate packages (which we could handle together).
Have you considered using rappdirs for default data directories?
Use of sergeant might be a fun use-case.

raymondben commented 7 years ago

Thanks @noamross - could you perhaps expand a little on what you are thinking of in terms of provenance (I have my interpretation of that, but I'm not sure it's what you're thinking of). Which frameworks do you have in mind here?

noamross commented 7 years ago

Primarily: When one does an analysis using synced files, how is the original origin, date, and version recorded and maintained? How can a version of analysis be linked to to a version of the data if data is maintained separately from analysis code?

Some possible solutions would be

Being able to retrieve the source, timestamp of download and hashes, of any local stores of data.
Being able to maintain multiple versions or version-controlled repositories of these local stores. (possibly using git, git-lfs, AWS versioned stores, or just a directory that is "frozen" to link to an analysis)
Copying over relevant portions of data into version controlled repositories as needed
Relying on the data source when the provider does versioning (e.g., different versions of files may be available to sync, such as via versioned DOIs or URIs), and allowing the user to change the version of the source.

This is an underdeveloped area and I really wouldn't expect your package to solve it. You might already have a workflow that addresses some of these things that could be expanded on in a vignette, or just formalized into helper functions in the package.

raymondben commented 7 years ago

OK, thanks. We can go some way down this road (e.g. our data source definitions already use the source's datacite DOI, if it has one). Certainly happy to discuss in e.g. a vignette, since we've talked a lot about this amongst ourselves and colleagues.

ropensci / software-review

Presub enquiry: bowerbird #129