presagia-analytics / ctrialsgov

Query Data from ClinicalTrials.gov
https://presagia-analytics.github.io/ctrialsgov/
Other
13 stars 3 forks source link

Total size is too big for CRAN #2

Closed kaneplusplus closed 3 years ago

kaneplusplus commented 3 years ago

The size of the 2.5% sample is too big for CRAN... especially after I added a 1.6 MB subset of recent cancer trials. Can we pare this down to 1.5%. Also, is there a better subset to grab for the nlp examples?

statsmaths commented 3 years ago

Great point. I just pushed a new version that locally passes CRAN checks on my machine.

I am currently just using the (now) 1% sample for the NLP vignette. Happy to switch that out if you would like. I was thinking that it would be nice to have a final vignette with a more focused case study, maybe on the cancer studies? Or perhaps COVID vaccines trials?

s-u commented 3 years ago

Can't you just download the data in the vignette? If it's not open already you can put it on Github as a release...

statsmaths commented 3 years ago

We do have the full dataset on GitHub within a branch (as a release would work too), but we wouldn't want to use that for the vignette anyway. The text analysis tools are only designed to be used on a curated subset of the clinical trials (order thousands), not the full set of 300k. Even if we did, it would take far too long to run.

s-u commented 3 years ago

Sure, but you can create the sample and publish it there for the vignette and thus not care about the size of the package.

statsmaths commented 3 years ago

Interesting. I guess I always thought it was a good practice to have a sample data set small enough to be contained in the package for direclty running examples, tests, etc.

Do you just think that's not important anymore given the ubiquity of fast internet connections and places to store medium-sized datasets?

s-u commented 3 years ago

Well that depends - for tests/examples the data can be tiny and doesn't really have to be meaningful. For vignettes it is more common try to have more realistic data and it's ok to require internet access. So one way to put it - if you can provide meaningful data within the size limit then it is preferred, but if the size limit doesn't allow for more realistic and/or practically meaningful analyses then I would suggest downloads.