Amount of data accessible with this pkg?

sckott commented 5 years ago

👋 as part of preparing an rOpenSci annual report, we're trying to estimate amount of data the various pkgs in our suite provide access to.

Do you have a sense for how much data (e.g., in GB) one can access through this pkg? And/or whatever metric is most relevant for this data (sequences/articles/taxa?)?

dwinter commented 5 years ago

Hi @sckott ,

Cool, I'll see if I can find estimates in Gb from the NCBI. In the mean time, I just re-ran the first code-block from the rentrez paper and got the following for numbers of records from key databases:

The USA National Center for Biotechnology Information (NCBI) is one of the world's largest and most important sources of biological data. At the time of writing, the NCBI PubMed database provided information on 30.2 million journal articles, including 5.8 million full text records. The NCBI Nucleotide Database (including GenBank) had data for 412.7 million different sequences and dbSNP described 686.6 million different genetic variants. Records from all of these databases can be cross-referenced with the 1.8 million species in the NCBI taxonomy, and PubMed entries can be searched for using a controlled vocabulary containing 279 thousand unique terms.

sckott commented 5 years ago

nice, thanks David. Those numbers are good enough even if you can't get a size estimate.

ropensci / rentrez

Amount of data accessible with this pkg? #142