Closed kaneplusplus closed 3 years ago
Great point. I just pushed a new version that locally passes CRAN checks on my machine.
I am currently just using the (now) 1% sample for the NLP vignette. Happy to switch that out if you would like. I was thinking that it would be nice to have a final vignette with a more focused case study, maybe on the cancer studies? Or perhaps COVID vaccines trials?
Can't you just download the data in the vignette? If it's not open already you can put it on Github as a release...
We do have the full dataset on GitHub within a branch (as a release would work too), but we wouldn't want to use that for the vignette anyway. The text analysis tools are only designed to be used on a curated subset of the clinical trials (order thousands), not the full set of 300k. Even if we did, it would take far too long to run.
Sure, but you can create the sample and publish it there for the vignette and thus not care about the size of the package.
Interesting. I guess I always thought it was a good practice to have a sample data set small enough to be contained in the package for direclty running examples, tests, etc.
Do you just think that's not important anymore given the ubiquity of fast internet connections and places to store medium-sized datasets?
Well that depends - for tests/examples the data can be tiny and doesn't really have to be meaningful. For vignettes it is more common try to have more realistic data and it's ok to require internet access. So one way to put it - if you can provide meaningful data within the size limit then it is preferred, but if the size limit doesn't allow for more realistic and/or practically meaningful analyses then I would suggest downloads.
The size of the 2.5% sample is too big for CRAN... especially after I added a 1.6 MB subset of recent cancer trials. Can we pare this down to 1.5%. Also, is there a better subset to grab for the nlp examples?