etl: framework for medium data

beanumber commented 7 years ago

Summary

What does this package do? (explain in 50 words or less): Facilitates predictable and pipeable ETL (extract-transform-load) operations for publicly-accessible medium data sets
Paste the full DESCRIPTION file inside a code block below:

Package: etl
Type: Package
Title: Extract-Transform-Load Framework for Medium Data
Version: 0.3.6
Date: 2017-07-20
Authors@R: c(
    person("Ben", "Baumer", email = "ben.baumer@gmail.com",
      role = c("aut", "cre")),
    person("Carson", "Sievert", email = "cpsievert1@gmail.com", role = "ctb"))
Maintainer: Ben Baumer <ben.baumer@gmail.com>
Description: A predictable and pipeable framework for performing ETL 
    (extract-transform-load) operations on publicly-accessible medium-sized data 
    set. This package sets up the method structure and implements generic 
    functions. Packages that depend on this package download specific data sets 
    from the Internet, clean them up, and import them into a local or remote 
    relational database management system.
License: CC0
LazyData: TRUE
Imports:
    DBI,
    datasets,
    downloader,
    lubridate,
    methods,
    stringr,
    readr,
    utils
Depends:
    R (>= 2.10),
    dplyr
Suggests:
    airlines,
    dbplyr,
    knitr,
    RSQLite,
    RPostgreSQL,
    RMySQL,
    MonetDBLite,
    ggplot2,
    testthat,
    rmarkdown
URL: http://github.com/beanumber/etl
BugReports: https://github.com/beanumber/etl/issues
RoxygenNote: 6.0.1
VignetteBuilder: knitr

URL for the package (the development repository, not a stylized html page): http://github.com/beanumber/etl
Please indicate which category or categories from our package fit policies this package falls under *and why(? (e.g., data retrieval, reproducibility. If you are unsure, we suggest you make a pre-submission inquiry.):

reproducibility, because the extensions of this package will lead to reproducible medium data set used in research data retrieval, since the extensions of this package download data data munging, since the extensions of this package transform raw data into CSVs

Who is the target audience?
R developers for the etl package itself R users for etl-dependent packages
Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category? No. This package depends heavily on dplyr and dbplyr, but it provides functionality specific to the ETL process that is not present in either.

Requirements

Confirm each of the following by checking the box. This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has a CRAN and OSI accepted license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, including reporting of test coverage, using services such as Travis CI, Coeveralls and/or CodeCov.
[x] I agree to abide by ROpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Publication options

[x] Do you intend for this package to go on CRAN?
[ ] Do you wish to automatically submit to the Journal of Open Source Software? If so:
- [ ] The package contains a paper.md with a high-level description in the package root or in inst/.
- [ ] The package is deposited in a long-term repository with the DOI:
- (Do not submit your package separately to JOSS)

Detail

[x] Does R CMD check (or devtools::check()) succeed? Paste and describe any errors or warnings:
[x] Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:
If this is a resubmission following rejection, please explain the change in circumstances:
If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:

maelle commented 7 years ago

Thanks a lot for your submission @beanumber! We (rOpenSci onboarding editors) discussed the fit of the package and don't think it's in scope. In particular we couldn't see the scientific application of the package, why it'd lead to more reproducibility than other approaches. If you disagree with this decision feel free to provide us with a more specific/descriptive explanation.

Don't hesitate to submit other packages in the future, potentially starting with a pre-submission enquiry in this same repo.

beanumber commented 7 years ago

@maelle Thanks for your response. I suppose it's sort of hard to see the value of this package on its own, since it's sort of a meta package. The idea is that the suite of packages that depend on this package will provide a consistent, robust user experience, instead of a collection of packages that all work in idiosyncratic ways.

nicholasjhorton commented 7 years ago

For me, the etl package is attractive since it provides a way for people to share data is a principled fashion (even if the individual files being shared are larger than 50MB). This gets around the 5MB (a crazy low value) recommended package size on CRAN and also simplifies the use of a github package install to create a specific environment.

maelle commented 7 years ago

Thanks both but it's still unclear how this improves reproducibility compared to existing approaches such as e.g. this one for big datasets? (data provenance, versioning, etc.)

Our saying the package is out-of-scope doesn't mean it's useless, of course!

beanumber commented 7 years ago

I don't know if this will change your position @maelle, but I've posted the long-form article for this on the arXiv. The manuscript explains the package itself and the purpose of the package in far greater detail.

maelle commented 7 years ago

Thanks @beanumber, the editors position was because we couldn't see how the package helps reproducibility compared to existing approaches. Is there a part of the article dealing more specifically with this? In any case good work.

Oops edited now that I see the title of the paper :man_facepalming:

beanumber commented 7 years ago

Section 2 addresses this. You might be interested in Section 4.2 to see an example of how this might work in practice.

maelle commented 7 years ago

@beanumber, thanks for providing us the link to your manuscript.

After discussion within the editorial team we still think that albeit very useful etl is out of scope for rOpenSci and here is why:

It is a general data manipulation tool, not specifically aimed at retrieving or extracting a data type or source
The reproducibility packages that are in scope as per our policies are "Tools that facilitate reproducible research. This includes packages that facilitate use of version control, provenance tracking, automated testing of data inputs and statistical outputs, citation of software and scientific literature.",

which isn't the case for etl. etl doesn't help tracking provenance or versions of a dataset.

We however encourage the development of your package and your communication efforts for making it better known. We suggest you submit it to the R Journal.

Thanks again, and don't hesitate to ask any question.

ropensci / software-review