Closed devonorourke closed 5 years ago
What R version do you have, and what version of deps. Can you just paste in what you get from devtools::session_info()
or sessionInfo()
Think I found solution, trying it now...
Looks like there are two things going on here:
Arthropoda
gives ~3.8 million records, and that data takes a long time to download. So just for your own sanity, it might be better to break that up into groups as you describe din your email to merawToChar()
function. However, with multiple = TRUE
for that function call, it can work - which gives each character as a separate element of a vector - then we can paste0(..., collapse = "")
them together - I'll push this change soon. However, in the above point, you'd probably want to do smaller data requests anyway, so that you aren't waiting for a massive request to finish (in which time internet can go out, or you have to go home, or something) - AND massive requests are more likely to cause problems for the data providerI'll see if there's anything I can do to warn users about data size limits...
Hi BOLD folks,
There must be a more computer savvy approach to this, but I've essentially split up the work of downloading all COI-5P sequences into four bins in attempts to keep things under 1 million rows per file: -- download all animal sequences except arthropods -- download all arthropod sequences except insects -- download all insect sequences except lepidopterans -- download all the damn moths.
The non-computer savvy way I get it to work is by going to BOLD's taxonomy page and copying and pasting all the animal taxa listed (by phyla). Then repeat by clicking on the arthropod link, highlighting all those; clicking on the insects and highlighting all those; clicking on the lepidopterans and clicking all those. From that list of lists I create a series of lines that follow the same format whereby using the bold R.package command:
Acanthopteroctetidae <- bold_seqspec(taxon="Acanthopteroctetidae", marker="COI-5P") write.table(Acanthopteroctetidae, file = "/home/orourke/R/Acanthopteroctetidae_taxmap_COI.txt", sep = "\t") rm(Acanthopteroctetidae) Adelidae <- bold_seqspec(taxon="Adelidae", marker="COI-5P") write.table(Adelidae, file = "/home/orourke/R/Adelidae_taxmap_COI.txt", sep = "\t") rm(Adelidae) ...and repeat for every one of those species in the list of lists...
You could break this up by the four big bins I've describe above, or for my purposes you can do it all together because you want a master file of all sequences anyway.
The approach listed above failed on my first attempt because the BOLD folks have mistakes in a very small number of the hyperlinks that the bold R-package uses to grab from the BOLD servers. Specifically, there are incongruities between certain taxa level and the url listing that taxa name/description and the public database url where you get the sequence info from. So the API in which you draw sequence and stats fails in the middle of these individual downloads.
In other words, you'll always get taxonomy info - try clicking on the links directly and they always work. But click that "public data" link at the bottom of that page and the search is broken. This issue isn't confined to this one instance. It extends to many other public data links where the taxonomy browser contains a match for the term. This case plays out for several different organism groups listed below. Here's one example:
Prodidactidae exists in your taxonomy page (here) but does not exist when you click to pull public data from that page. There are just two species entries of which one sequence exists, but again, you can't pull it from the public data link (here).
See the following list for the same effect:
Ratardidae Schistonoeidae Somabrachyidae Tridentaformidae Whalleyanidae Zoraptera Pentastomida
The solution to the problem was simply to exclude those individuals from the list. I haven't heard back from the BOLD people, but I'm guessing that if it's critical to include those sequences we could always look at the taxonomy page for each member, grab it's Genbank accession number, and enter it manually that way.
Once you exclude those members, the download proceeds as expected and takes less than an hour. Pretty good for millions of entries.
Finally, I'm not sure where I've made a mistake, but I was using some bash (awk, sed, paste, etc.) commands to concatenate and manipulate these files. Things have been getting a bit off and I'm worried it's because there is non-uniformity in the number of fields present within each of these separate files. Can't confirm yet if it's true but am going to try using some R commands in lieu of bash commands and see if the separate files can be merged successfully without issues downstream.
Holding out hope still - thanks for the input and help as always.
Devon
Two other things that would be potentially useful:
@devonorourke Sorry for the very long delay. This slipped off my radar.
Just pushed that fix for long vectors, see commit above.
When you pull out all these millions of records from BOLD, there are multiple sequences representing the same species. While this is certainly useful in some contexts, I wonder if it would be possible to specify to pull just the longest nucleotide sequence for a given representative species (if named to that taxa level). Note that you'd need to count only nucleotide characters not the gap/alignment character "-".
Possibly, I'll see how fast it can be, and if not fast, would leave to user to do
Related to the point above, I wonder if it's possible to specify not the species ID, but rather the BOLD BIN. If we could sort all unique BIN entries and just pull the longest representative sequence from there it might also be helpful in reducing the search space in subsequent OTU calling. Perhaps this sets a dangerous precedent and should be avoided in cases where species-level detection is a must; perhaps a way to avoid this would for BOLD as a database to identify BINs that are uniquely assigned to a single species and BINs that are grouped with multiple species. Or maybe that's wishful thinking?
there is the parameter bin
- have you not seen it? I don't have much to say on what BOLD could do. You could/should ask them
@devonorourke I assume this comment https://github.com/ropensci/bold/issues/29#issuecomment-220683712 is an email you sent to BOLD?
@devonorourke see fxn bold_filter
- was trying to add filtering by longest or shortest sequence to bold_seqspec
, but there's already too much complexity in that fxn (that is, too many ways for it to fail, making it hard to maintain and predict all failure scenarios) Let me know what you think
Yes to #29 comment https://github.com/ropensci/bold/issues/29#issuecomment-220683712 . Their response was less than inspiring. Basically stated that it wasn't a problem and I was doing something wrong.
On Fri, Sep 30, 2016 at 5:14 PM, Scott Chamberlain <notifications@github.com
wrote:
@devonorourke https://github.com/devonorourke I assume this comment #29 (comment) https://github.com/ropensci/bold/issues/29#issuecomment-220683712 is an email you sent to BOLD?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/bold/issues/29#issuecomment-250853896, or mute the thread https://github.com/notifications/unsubscribe-auth/AKqgXJTLQy3YcAzN_0c9llErespax8inks5qvXuogaJpZM4Ig1Oj .
Devon O'Rourke Graduate student in Molecular and Evolutionary Systems Biology University of New Hampshire Lab of Dr. Jeff Foster fozlab.weebly.com/guano
I ended up making a 'full-BOLD' and 'singlerep-BOLD' dataset by the following (note this likely has it's own complications but as a general start was helpful in getting a species-only database). Note that the first file, "allinfo_arthropodonly.txt" is what I would pull using your 'bold' R package's 'seqspec' command.
##################################################################################### #################
df_allinfo_arthropodonly <- read.delim(file = "/pathto/allinfo_arthropodonly.txt", sep = "\t", na.strings = " ", stringsAsFactors = FALSE)
df_allinfo_arthropodonly[is.na(df_allinfo_arthropodonly)] <- "UNDEFINED"
df_allinfo_arthropodonly$speciesname <- (gsub(" ", "", df_allinfo_arthropodonly$species_name))
library(stringr)
new column in df df_allinfo_arthropodonly$gaplength <- str_count(df_allinfo_arthropodonly$nucleotides, "-")
df_allinfo_arthropodonly <- df_allinfo_arthropodonly[order(df_allinfo_arthropodonly$gaplength),]
with preference to smallest 'gaplength' df_singlerep_arthonly <- df_allinfo_arthropodonly[!duplicated(df_allinfo_arthropodonly[c("bin_uri", "species_name")]),]
######################################################################################################
I would use that data frame to create the subsequent taxonomy file and fasta files of interest.
On Fri, Sep 30, 2016 at 5:58 PM, Scott Chamberlain <notifications@github.com
wrote:
@devonorourke https://github.com/devonorourke see fxn bold_filter - was trying to add filtering by longest or shortest sequence to bold_seqspec, but there's already too much complexity in that fxn (that is, too many ways for it to fail, making it hard to maintain and predict all failure scenarios) Let me know what you think
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/bold/issues/29#issuecomment-250862207, or mute the thread https://github.com/notifications/unsubscribe-auth/AKqgXCgyLhmLsnii3rPaq7lSs2OUbjL3ks5qvYYQgaJpZM4Ig1Oj .
Devon O'Rourke Graduate student in Molecular and Evolutionary Systems Biology University of New Hampshire Lab of Dr. Jeff Foster fozlab.weebly.com/guano
Sorry that weren't more helpful.
So the code in your comment above is what you'd imagined the bold_filter
would do?
I'm very excited to use the bold R-package but have been running into trouble when trying to pull a large dataset together using the
bold_seqspec
script. Specifically, I'm trying to grab all COI-5P sequences from BOLD's database. That's a lot of data.I started by installing the package and ran the following script successfully:
This creates a 53-column, 58-line text file. I've performed this task in both R-studio and from the command line (running Linux 3.13.0-85-generic, R version 3.3.0) successfully for the above script.
However, I wanted to be able to modify the script above to include the biggest group in one chunk - Arthropods - by running these commands:
Unfortunately I get an error message:
I was under the impression that the relatively recent releases of R have enabled long vectors to be supported. Perhaps more to the point, I wasn't thinking that this was a particularly long vector in terms of columns, but perhaps it does exceed that 900,000 limit in terms of rows (there certainly are more than 900,000 rows in this dataset).
To that end, perhaps you can speak to the maximum number of entries that may be downloaded at once by these scripts (should one exist).
Thanks very much