Closed devonorourke closed 5 years ago
one thing could try is async requests, but i imagine they don't have amazing server resources, so may be too much for their system
Thanks Scott,
There must be a way to break up the data.frame into a list of smaller data.frames, then iteratively request records, right?
Or perhaps that list of data.frames can't be directly accessed with the bold_identify
command?
Maybe I need to break up the fasta into smaller chunks directly and run the R script as part of a loop.
Do you have a sense of what the largest number of fasta records you've collected with bold_identify
is?
Thanks for the consideration
install from remotes::install_github("ropensci/bold@async")
, reload R, and try e.g.,
library(tidyverse)
library(bold)
df <- read_delim(file = "https://github.com/devonorourke/tidybug/raw/master/data/qiime/dada2.arthASVs.txt.gz",
delim = "\t", col_names = FALSE)
colnames(df) <- c("id", "seq")
system.time(out_async2 <- bold_identify_async(df$seq[1:60], db = "COX1"))
#> user system elapsed
#> 10.842 0.219 92.336
so 1.5 min for 60 sequences.
it's entirely possible if you throw 13K sequences using async requests at their servers they may not be able to handle that, so i'd still break up the sequnences into chunks, e.g., you could do 100 at a time
i may or may not keep this async function
Thanks Scott, Giving the new script a shot now.
Update for you @sckott - I haven't figure out why yet, but the issue isn't apparently with the batch size of query sequences being submitted. It keeps stalling at a single fasta record: number 1191.
For example, if you try:
out_list_nope1 <- bold_identify_async(df$seq[1190:1192], db = "COX1")
That won't work, but you can generate data frames on either side of the 1191
th row:
out_list_yep1 <- bold_identify_async(df$seq[1189:1190], db = "COX1")
out_list_yep2 <- bold_identify_async(df$seq[1192:1193], db = "COX1")
And sure enough, if you try just that single bad record, you get the same error message as the very start of this thread:
out_list_nope2 <- bold_identify_async(df$seq[1191:1191], db = "COX1")
which throws this error:
Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html, :
EntityRef: expecting ';' [23]
What exactly is that error message indicating? What I have read about online suggests that it's indicating that something in the XML parsing is going wrong, and that a ;
character might need to be escaped somewhere in the code.
I looked at the record that's breaking things, and there is nothing strange about the reference sequence (just ATCG characters, no white space, no weird delimiters or non-ATCG alphabet... perfectly normal). There isn't anything strange in the sequence ID either, but it's not like the bold_identify
function cares about that, right? Here's the sequence:
>prob
AACCCTATACTTTTTATTTGGAATTTGAGCGGGTATAGTAGGTACTAGCTTAAGTATATTAATTCGTCTAGAGCTAGGACAACCCGGTGTATTTTTAGAAGATGACCAAACCTATAACGTTATTGTAACAGCCCACGCTTTTATTATAATTTTCTTCATAATTATACCAATCATAATTGGA
I pasted that sequence in BOLD's online identification page and it returned a typical list of matches which makes me further think that there isn't a problem on my end in generating this error (but I could be (and usually am) wrong 😄!)...
Is there any possibility that the species names being returned in that list aren't being properly parsed, and that's what's throwing the error? For instance, the species names in that list include things like:
sp. 2KJ&EM
sp. 2KJEM
Could it be that the &
symbol isn't being escaped in the bold_identify
script and it's what's breaking the program?
Thanks again for your continued support
thanks for the info. looks like its down to those ampersands. if we just replace any &
with &
instead, we don't get that error anymore. , reinstall, reload then try again. let me know if it works
Great, thanks @sckott .
To reinstall, do I use the remotes::install_github("ropensci/bold@async")
command you suggested before? I tried doing that but it still threw the same error as before.
Thanks for the help!
i added that fix https://github.com/ropensci/bold/commit/9f7000171fabebdab20a41c9a85087a95dbf0996 to master branch, so just ropensci/bold
without the async
okay, that fix is on async branch, so you can install from async to get both the fix and async
got it - thanks very much It looks like the update to the master branch worked perfect. When I reinstalled bold like this:
remotes::install_github("ropensci/bold")
I can see the following output:
Downloading GitHub repo ropensci/bold@master
Running `R CMD build`...
* checking for file ‘/tmp/Rtmpk2NGfC/remotes1e2e97767ceee/ropensci-bold-9f70001/DESCRIPTION’ ... OK
...
* building ‘bold_0.8.6.9139.tar.gz’
I can execute this command and avoid the typical error (it completes properly):
out_list1 <- bold_identify(df$seq[1190:1192], db = "COX1")
However, if I try installing the async
branch...
remotes::install_github("ropensci/bold@async")
Downloading GitHub repo ropensci/bold@async
* checking for file ‘/tmp/RtmpcMys3b/remotes1ee1948e6731d/ropensci-bold-3c6f95d/DESCRIPTION’ ... OK
...
* building ‘bold_0.8.6.9139.tar.gz’
... it seems like things fail:
out_list1 <- bold_identify_async(df$seq[1190:1192], db = "COX1")
Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html, :
EntityRef: expecting ';' [23]
So it seems like the master branch version worked, but the alternate branch didn't? Not sure what I am doing wrong. Thanks for all your help!
did you restart R after installing from the async branch? that should have worked
Yep; quit the R session and restarted.
I'll give it another shot.
Multiple attempts of installing, restarting R, and can't get the @async
branch to cooperate. Tried both on local machine and on compute cluster without any success - the master branch is working fine though.
One related through: You already addressed a problem here I was having... when the user starts with a data.frame and uses bold_identify
, the output doesn't include the query names as an output variable and isn't part of the row.names. Your solution of having the user convert their data.frame to a named list worked for me just fine, but I think it would be very helpful (if possible!) for the default bold_identify
function to include both the query and the match as distinct fields in the output.
Thanks as always!
fixed issue, didn't have ampersand replacement in the async specific parser
Hi Scott, Not sure what to make of this error:
Generated from these data:
For what it's worth, I can get a small sample of the same dataset to return values:
Does it make sense to try something iteratively when you have > 10,000 sequences like I do?
Thanks