ropensci / phylotaR

An automated pipeline for retrieving orthologous DNA sequences from GenBank in R
https://docs.ropensci.org/phylotaR
Other
23 stars 9 forks source link

Amount of data accessible with this pkg? #42

Closed sckott closed 5 years ago

sckott commented 5 years ago

👋 as part of preparing an rOpenSci annual report, we're trying to estimate amount of data the various pkgs in our suite provide access to.

Do you have a sense for how much data (e.g., in GB) one can access through this pkg? And/or whatever metric is most relevant for this data (sequences maybe?)?

sckott commented 5 years ago

any ideas here @DomBennett ?

DomBennett commented 5 years ago

Hi @sckott,

Yup, sequences!

The package acts as a portal to NCBI GenBank, which, as of August 2019, hosts some 213,865,349 sequences. But the package likely also makes use of the WGS information too as it will pull out any relevant annotated sequence. So maybe you're best data metric is number of bases: 366,733,917,629 + 5,585,922,333,160 = ~6e+12

For the most part, the phylotaR package uses the rentrez package. So whatever stats you pull up for that package, with respect to GenBank, applies to this one too.

sckott commented 5 years ago

thanks for this! David did give me some numbers for rentrez, but that bases estimate is a nice one he didn't have.