ropensci / software-review

rOpenSci Software Peer Review.
292 stars 104 forks source link

etl: framework for medium data #140

Closed beanumber closed 7 years ago

beanumber commented 7 years ago

Summary

Package: etl
Type: Package
Title: Extract-Transform-Load Framework for Medium Data
Version: 0.3.6
Date: 2017-07-20
Authors@R: c(
    person("Ben", "Baumer", email = "ben.baumer@gmail.com",
      role = c("aut", "cre")),
    person("Carson", "Sievert", email = "cpsievert1@gmail.com", role = "ctb"))
Maintainer: Ben Baumer <ben.baumer@gmail.com>
Description: A predictable and pipeable framework for performing ETL 
    (extract-transform-load) operations on publicly-accessible medium-sized data 
    set. This package sets up the method structure and implements generic 
    functions. Packages that depend on this package download specific data sets 
    from the Internet, clean them up, and import them into a local or remote 
    relational database management system.
License: CC0
LazyData: TRUE
Imports:
    DBI,
    datasets,
    downloader,
    lubridate,
    methods,
    stringr,
    readr,
    utils
Depends:
    R (>= 2.10),
    dplyr
Suggests:
    airlines,
    dbplyr,
    knitr,
    RSQLite,
    RPostgreSQL,
    RMySQL,
    MonetDBLite,
    ggplot2,
    testthat,
    rmarkdown
URL: http://github.com/beanumber/etl
BugReports: https://github.com/beanumber/etl/issues
RoxygenNote: 6.0.1
VignetteBuilder: knitr

reproducibility, because the extensions of this package will lead to reproducible medium data set used in research data retrieval, since the extensions of this package download data data munging, since the extensions of this package transform raw data into CSVs

Requirements

Confirm each of the following by checking the box. This package:

Publication options

Detail

maelle commented 7 years ago

Thanks a lot for your submission @beanumber! We (rOpenSci onboarding editors) discussed the fit of the package and don't think it's in scope. In particular we couldn't see the scientific application of the package, why it'd lead to more reproducibility than other approaches. If you disagree with this decision feel free to provide us with a more specific/descriptive explanation.

Don't hesitate to submit other packages in the future, potentially starting with a pre-submission enquiry in this same repo.

beanumber commented 7 years ago

@maelle Thanks for your response. I suppose it's sort of hard to see the value of this package on its own, since it's sort of a meta package. The idea is that the suite of packages that depend on this package will provide a consistent, robust user experience, instead of a collection of packages that all work in idiosyncratic ways.

nicholasjhorton commented 7 years ago

For me, the etl package is attractive since it provides a way for people to share data is a principled fashion (even if the individual files being shared are larger than 50MB). This gets around the 5MB (a crazy low value) recommended package size on CRAN and also simplifies the use of a github package install to create a specific environment.

maelle commented 7 years ago

Thanks both but it's still unclear how this improves reproducibility compared to existing approaches such as e.g. this one for big datasets? (data provenance, versioning, etc.)

Our saying the package is out-of-scope doesn't mean it's useless, of course!

beanumber commented 7 years ago

I don't know if this will change your position @maelle, but I've posted the long-form article for this on the arXiv. The manuscript explains the package itself and the purpose of the package in far greater detail.

maelle commented 7 years ago

Thanks @beanumber, the editors position was because we couldn't see how the package helps reproducibility compared to existing approaches. Is there a part of the article dealing more specifically with this? In any case good work.

Oops edited now that I see the title of the paper :man_facepalming:

beanumber commented 7 years ago

Section 2 addresses this. You might be interested in Section 4.2 to see an example of how this might work in practice.

maelle commented 7 years ago

@beanumber, thanks for providing us the link to your manuscript.

After discussion within the editorial team we still think that albeit very useful etl is out of scope for rOpenSci and here is why:

which isn't the case for etl. etl doesn't help tracking provenance or versions of a dataset.

We however encourage the development of your package and your communication efforts for making it better known. We suggest you submit it to the R Journal.

Thanks again, and don't hesitate to ask any question.