kbenoit / sophistication

R package associated with Benoit, Munger and Spirling (2017) paper(s)
42 stars 7 forks source link

Data Corpora Exceeding CRAN Size Limit #17

Open gmhurtado opened 4 years ago

gmhurtado commented 4 years ago

As far as I can tell, the data corpora currently included in Sophistication (e.g., data_corpus_fifthgrade.RData, data_corpus_partybroadcasts.RData, etc.) are larger than the CRAN limit of 5 MB.

I experimented with using a drat repository (the idea in this article https://journal.r-project.org/archive/2017/RJ-2017-026/index.html shared with me by @ArthurSpirling) as a possible solution. The basic idea is that a code package submitted to CRAN can interact with a larger data package in a drat repository hosted on GitHub. 

I implemented the first two steps outlined in the article and created a drat repository and posted a data package with the Sophistication corpora in it (here: https://github.com/gmhurtado/drat/tree/gh-pages), which all worked as expected. This implementation is by no means a ‘formal’ effort and only temporary, as I forked the original drat repository (https://github.com/eddelbuettel/drat) to my own account and did not include documentation.

The next step of this solution would be to update the source code’s DESCRIPTION file to reflect the new dependency by listing the new data package as a 'suggested' package, as well as adding the drat repository address to the file. Instructions for installing the data package might be useful as well, but are not necessary.

Finally, the source code should be modified to customize behavior when the data package is loaded. In particular, the source code should check whether or not the data package has been installed upon loading. Additionally, if any of the source code's functions, tests, examples, or the like are conditional on data from the data package being loaded, the corresponding code should be updated to check for the package's installation.

If a drat repository is the best solution for solving the data size issue, then I can replicate the drat repository implementation formally, and write out the explicit changes needed for the source code package.

kbenoit commented 3 years ago

This is a good idea! An newer alternative that I've seen is to use the pins package.

We really only need the following in the core package:

It's really just the first two that are the problem.

gmhurtado commented 3 years ago

Thank you for the feedback and advice! I looked into the pins package and tried pinning some sources to a GitHub board. I came to the (possibly incorrect) conclusion that the pins I made could only be shared with individuals whom I had granted repo access to on GitHub. I also found that pins could be shared using RStudio Connect as opposed to a GitHub or Kaggle board, but I believe that this requires purchase.

I had a hard time finding examples of individuals who used pins to host data for a code package, so my understanding of pins and its usage may very well be flawed or incomplete. However, based on what I understand about pins so far, sharing the pinned datasets may either be cumbersome or come at a cost.

If my reasoning is incorrect, or if you have any resources you might be able to share about using pins for hosting data packages, I would really appreciate any guidance.

Thank you for your time!

kbenoit commented 3 years ago

I've never used pins, but it looks like it's possible to park the files on a website, register this as a "board", and then allow any user access. See https://pins.rstudio.com/articles/boards-websites.html.

If I'm wrong then maybe the drat method is better.