Closed beanumber closed 7 years ago
Thanks a lot for your submission @beanumber! We (rOpenSci onboarding editors) discussed the fit of the package and don't think it's in scope. In particular we couldn't see the scientific application of the package, why it'd lead to more reproducibility than other approaches. If you disagree with this decision feel free to provide us with a more specific/descriptive explanation.
Don't hesitate to submit other packages in the future, potentially starting with a pre-submission enquiry in this same repo.
@maelle Thanks for your response. I suppose it's sort of hard to see the value of this package on its own, since it's sort of a meta package. The idea is that the suite of packages that depend on this package will provide a consistent, robust user experience, instead of a collection of packages that all work in idiosyncratic ways.
For me, the etl package is attractive since it provides a way for people to share data is a principled fashion (even if the individual files being shared are larger than 50MB). This gets around the 5MB (a crazy low value) recommended package size on CRAN and also simplifies the use of a github package install to create a specific environment.
Thanks both but it's still unclear how this improves reproducibility compared to existing approaches such as e.g. this one for big datasets? (data provenance, versioning, etc.)
Our saying the package is out-of-scope doesn't mean it's useless, of course!
I don't know if this will change your position @maelle, but I've posted the long-form article for this on the arXiv. The manuscript explains the package itself and the purpose of the package in far greater detail.
Thanks @beanumber, the editors position was because we couldn't see how the package helps reproducibility compared to existing approaches. Is there a part of the article dealing more specifically with this? In any case good work.
Oops edited now that I see the title of the paper :man_facepalming:
Section 2 addresses this. You might be interested in Section 4.2 to see an example of how this might work in practice.
@beanumber, thanks for providing us the link to your manuscript.
After discussion within the editorial team we still think that albeit very useful etl
is out of scope for rOpenSci and here is why:
It is a general data manipulation tool, not specifically aimed at retrieving or extracting a data type or source
The reproducibility packages that are in scope as per our policies are "Tools that facilitate reproducible research. This includes packages that facilitate use of version control, provenance tracking, automated testing of data inputs and statistical outputs, citation of software and scientific literature.",
which isn't the case for etl
. etl
doesn't help tracking provenance or versions of a dataset.
We however encourage the development of your package and your communication efforts for making it better known. We suggest you submit it to the R Journal.
Thanks again, and don't hesitate to ask any question.
Summary
What does this package do? (explain in 50 words or less): Facilitates predictable and pipeable ETL (extract-transform-load) operations for publicly-accessible medium data sets
Paste the full DESCRIPTION file inside a code block below:
URL for the package (the development repository, not a stylized html page): http://github.com/beanumber/etl
Please indicate which category or categories from our package fit policies this package falls under *and why(? (e.g., data retrieval, reproducibility. If you are unsure, we suggest you make a pre-submission inquiry.):
reproducibility, because the extensions of this package will lead to reproducible medium data set used in research data retrieval, since the extensions of this package download data data munging, since the extensions of this package transform raw data into CSVs
Who is the target audience?
R developers for the
etl
package itself R users foretl
-dependent packagesAre there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category? No. This package depends heavily on
dplyr
anddbplyr
, but it provides functionality specific to the ETL process that is not present in either.Requirements
Confirm each of the following by checking the box. This package:
Publication options
paper.md
with a high-level description in the package root or ininst/
.Detail
[x] Does
R CMD check
(ordevtools::check()
) succeed? Paste and describe any errors or warnings:[x] Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:
If this is a resubmission following rejection, please explain the change in circumstances:
If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names: