devonorourke commented 8 years ago

I'm very excited to use the bold R-package but have been running into trouble when trying to pull a large dataset together using the bold_seqspec script. Specifically, I'm trying to grab all COI-5P sequences from BOLD's database. That's a lot of data.

I started by installing the package and ran the following script successfully:

install.packages("bold")
library(bold)
df_tiny <- bold_seqspec(taxon="Echiura", marker="COI-5P")
write.table(df_tiny, file = "/home/R/euthria_bold_seqspec.txt", sep = "\t")

This creates a 53-column, 58-line text file. I've performed this task in both R-studio and from the command line (running Linux 3.13.0-85-generic, R version 3.3.0) successfully for the above script.

However, I wanted to be able to modify the script above to include the biggest group in one chunk - Arthropods - by running these commands:

df_large <- bold_seqspec(taxon="Arthropoda", marker="COI-5P")
write.table(df_large, file = "/home/R/arthropoda_bold_seqspec.txt", sep = "\t").

Unfortunately I get an error message:

Error in rawToChar(content(out, encoding = "UTF-8")) : 
  long vectors not supported yet: raw.c:68

I was under the impression that the relatively recent releases of R have enabled long vectors to be supported. Perhaps more to the point, I wasn't thinking that this was a particularly long vector in terms of columns, but perhaps it does exceed that 900,000 limit in terms of rows (there certainly are more than 900,000 rows in this dataset).

To that end, perhaps you can speak to the maximum number of entries that may be downloaded at once by these scripts (should one exist).

Thanks very much

sckott commented 8 years ago

What R version do you have, and what version of deps. Can you just paste in what you get from devtools::session_info() or sessionInfo()

sckott commented 8 years ago

Think I found solution, trying it now...

sckott commented 8 years ago

Looks like there are two things going on here:

The data you request for the search Arthropoda gives ~3.8 million records, and that data takes a long time to download. So just for your own sanity, it might be better to break that up into groups as you describe din your email to me
The other is that very long strings were not supported by the rawToChar() function. However, with multiple = TRUE for that function call, it can work - which gives each character as a separate element of a vector - then we can paste0(..., collapse = "") them together - I'll push this change soon. However, in the above point, you'd probably want to do smaller data requests anyway, so that you aren't waiting for a massive request to finish (in which time internet can go out, or you have to go home, or something) - AND massive requests are more likely to cause problems for the data provider

sckott commented 8 years ago

I'll see if there's anything I can do to warn users about data size limits...

devonorourke commented 8 years ago

Hi BOLD folks,

There must be a more computer savvy approach to this, but I've essentially split up the work of downloading all COI-5P sequences into four bins in attempts to keep things under 1 million rows per file: -- download all animal sequences except arthropods -- download all arthropod sequences except insects -- download all insect sequences except lepidopterans -- download all the damn moths.

The non-computer savvy way I get it to work is by going to BOLD's taxonomy page and copying and pasting all the animal taxa listed (by phyla). Then repeat by clicking on the arthropod link, highlighting all those; clicking on the insects and highlighting all those; clicking on the lepidopterans and clicking all those. From that list of lists I create a series of lines that follow the same format whereby using the bold R.package command:

Acanthopteroctetidae <- bold_seqspec(taxon="Acanthopteroctetidae", marker="COI-5P") write.table(Acanthopteroctetidae, file = "/home/orourke/R/Acanthopteroctetidae_taxmap_COI.txt", sep = "\t") rm(Acanthopteroctetidae) Adelidae <- bold_seqspec(taxon="Adelidae", marker="COI-5P") write.table(Adelidae, file = "/home/orourke/R/Adelidae_taxmap_COI.txt", sep = "\t") rm(Adelidae) ...and repeat for every one of those species in the list of lists...

You could break this up by the four big bins I've describe above, or for my purposes you can do it all together because you want a master file of all sequences anyway.

The approach listed above failed on my first attempt because the BOLD folks have mistakes in a very small number of the hyperlinks that the bold R-package uses to grab from the BOLD servers. Specifically, there are incongruities between certain taxa level and the url listing that taxa name/description and the public database url where you get the sequence info from. So the API in which you draw sequence and stats fails in the middle of these individual downloads.

In other words, you'll always get taxonomy info - try clicking on the links directly and they always work. But click that "public data" link at the bottom of that page and the search is broken. This issue isn't confined to this one instance. It extends to many other public data links where the taxonomy browser contains a match for the term. This case plays out for several different organism groups listed below. Here's one example:

Prodidactidae exists in your taxonomy page (here) but does not exist when you click to pull public data from that page. There are just two species entries of which one sequence exists, but again, you can't pull it from the public data link (here).

See the following list for the same effect:

Ratardidae Schistonoeidae Somabrachyidae Tridentaformidae Whalleyanidae Zoraptera Pentastomida

The solution to the problem was simply to exclude those individuals from the list. I haven't heard back from the BOLD people, but I'm guessing that if it's critical to include those sequences we could always look at the taxonomy page for each member, grab it's Genbank accession number, and enter it manually that way.

Once you exclude those members, the download proceeds as expected and takes less than an hour. Pretty good for millions of entries.

Finally, I'm not sure where I've made a mistake, but I was using some bash (awk, sed, paste, etc.) commands to concatenate and manipulate these files. Things have been getting a bit off and I'm worried it's because there is non-uniformity in the number of fields present within each of these separate files. Can't confirm yet if it's true but am going to try using some R commands in lieu of bash commands and see if the separate files can be merged successfully without issues downstream.

Holding out hope still - thanks for the input and help as always.

Devon

devonorourke commented 8 years ago

Two other things that would be potentially useful:

When you pull out all these millions of records from BOLD, there are multiple sequences representing the same species. While this is certainly useful in some contexts, I wonder if it would be possible to specify to pull just the longest nucleotide sequence for a given representative species (if named to that taxa level). Note that you'd need to count only nucleotide characters not the gap/alignment character "-".
Related to the point above, I wonder if it's possible to specify not the species ID, but rather the BOLD BIN. If we could sort all unique BIN entries and just pull the longest representative sequence from there it might also be helpful in reducing the search space in subsequent OTU calling. Perhaps this sets a dangerous precedent and should be avoided in cases where species-level detection is a must; perhaps a way to avoid this would for BOLD as a database to identify BINs that are uniquely assigned to a single species and BINs that are grouped with multiple species. Or maybe that's wishful thinking?

sckott commented 7 years ago

@devonorourke Sorry for the very long delay. This slipped off my radar.

Just pushed that fix for long vectors, see commit above.

When you pull out all these millions of records from BOLD, there are multiple sequences representing the same species. While this is certainly useful in some contexts, I wonder if it would be possible to specify to pull just the longest nucleotide sequence for a given representative species (if named to that taxa level). Note that you'd need to count only nucleotide characters not the gap/alignment character "-".

Possibly, I'll see how fast it can be, and if not fast, would leave to user to do

Related to the point above, I wonder if it's possible to specify not the species ID, but rather the BOLD BIN. If we could sort all unique BIN entries and just pull the longest representative sequence from there it might also be helpful in reducing the search space in subsequent OTU calling. Perhaps this sets a dangerous precedent and should be avoided in cases where species-level detection is a must; perhaps a way to avoid this would for BOLD as a database to identify BINs that are uniquely assigned to a single species and BINs that are grouped with multiple species. Or maybe that's wishful thinking?

there is the parameter bin - have you not seen it? I don't have much to say on what BOLD could do. You could/should ask them

sckott commented 7 years ago

@devonorourke I assume this comment https://github.com/ropensci/bold/issues/29#issuecomment-220683712 is an email you sent to BOLD?

sckott commented 7 years ago

@devonorourke see fxn bold_filter - was trying to add filtering by longest or shortest sequence to bold_seqspec, but there's already too much complexity in that fxn (that is, too many ways for it to fail, making it hard to maintain and predict all failure scenarios) Let me know what you think

devonorourke commented 7 years ago

Yes to #29 comment https://github.com/ropensci/bold/issues/29#issuecomment-220683712 . Their response was less than inspiring. Basically stated that it wasn't a problem and I was doing something wrong.

On Fri, Sep 30, 2016 at 5:14 PM, Scott Chamberlain <notifications@github.com

wrote:

@devonorourke https://github.com/devonorourke I assume this comment #29 (comment) https://github.com/ropensci/bold/issues/29#issuecomment-220683712 is an email you sent to BOLD?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/bold/issues/29#issuecomment-250853896, or mute the thread https://github.com/notifications/unsubscribe-auth/AKqgXJTLQy3YcAzN_0c9llErespax8inks5qvXuogaJpZM4Ig1Oj .

Devon O'Rourke Graduate student in Molecular and Evolutionary Systems Biology University of New Hampshire Lab of Dr. Jeff Foster fozlab.weebly.com/guano

devonorourke commented 7 years ago

I ended up making a 'full-BOLD' and 'singlerep-BOLD' dataset by the following (note this likely has it's own complications but as a general start was helpful in getting a species-only database). Note that the first file, "allinfo_arthropodonly.txt" is what I would pull using your 'bold' R package's 'seqspec' command.

##################################################################################### #################

import the "allinfo_arthropodonly.txt" file

df_allinfo_arthropodonly <- read.delim(file = "/pathto/allinfo_arthropodonly.txt", sep = "\t", na.strings = " ", stringsAsFactors = FALSE)

replace all 'NA' values with "UNDEFINED"

df_allinfo_arthropodonly[is.na(df_allinfo_arthropodonly)] <- "UNDEFINED"

replace all the whitespace in the 'species_name' field with underscores

df_allinfo_arthropodonly$speciesname <- (gsub(" ", "", df_allinfo_arthropodonly$species_name))

load package 'stringr' to count number of gaps in each nucleotide string

library(stringr)

Count the number of '-' character in each element of string and print as

new column in df df_allinfo_arthropodonly$gaplength <- str_count(df_allinfo_arthropodonly$nucleotides, "-")

sort the df by the gaplength field:

df_allinfo_arthropodonly <- df_allinfo_arthropodonly[order(df_allinfo_arthropodonly$gaplength),]

subset the df by removing duplicates (by 'bin_uri' AND 'species_name')

with preference to smallest 'gaplength' df_singlerep_arthonly <- df_allinfo_arthropodonly[!duplicated(df_allinfo_arthropodonly[c("bin_uri", "species_name")]),]

######################################################################################################

I would use that data frame to create the subsequent taxonomy file and fasta files of interest.

On Fri, Sep 30, 2016 at 5:58 PM, Scott Chamberlain <notifications@github.com

wrote:

@devonorourke https://github.com/devonorourke see fxn bold_filter - was trying to add filtering by longest or shortest sequence to bold_seqspec, but there's already too much complexity in that fxn (that is, too many ways for it to fail, making it hard to maintain and predict all failure scenarios) Let me know what you think

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/bold/issues/29#issuecomment-250862207, or mute the thread https://github.com/notifications/unsubscribe-auth/AKqgXCgyLhmLsnii3rPaq7lSs2OUbjL3ks5qvYYQgaJpZM4Ig1Oj .

Devon O'Rourke Graduate student in Molecular and Evolutionary Systems Biology University of New Hampshire Lab of Dr. Jeff Foster fozlab.weebly.com/guano

sckott commented 7 years ago

Sorry that weren't more helpful.

So the code in your comment above is what you'd imagined the bold_filter would do?

ropensci / bold

long vectors not supported yet? #29

import the "allinfo_arthropodonly.txt" file

replace all 'NA' values with "UNDEFINED"

replace all the whitespace in the 'species_name' field with underscores

load package 'stringr' to count number of gaps in each nucleotide string

Count the number of '-' character in each element of string and print as

sort the df by the gaplength field:

subset the df by removing duplicates (by 'bin_uri' AND 'species_name')