issue with `bold_identify`

devonorourke commented 5 years ago

Hi Scott, Not sure what to make of this error:

Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html,  : 
  EntityRef: expecting ';' [23]

Generated from these data:

library(tidyverse)
library(bold)

## import data
df <- read_delim(file = "https://github.com/devonorourke/tidybug/raw/master/data/qiime/dada2.arthASVs.txt.gz", delim = "\t", col_names = FALSE)
colnames(df) <- c("id", "seq")

## collect taxonomy information
out_list <- bold_identify(df$seq, db = "COX1", response = FALSE)

For what it's worth, I can get a small sample of the same dataset to return values:

## This returns a list of successfully aligned sequences
tiny.df <- head(df,10)
tiny.out_list <- bold_identify(tiny.df$seq, db = "COX1", response = FALSE)

Does it make sense to try something iteratively when you have > 10,000 sequences like I do?

Thanks

sckott commented 5 years ago

one thing could try is async requests, but i imagine they don't have amazing server resources, so may be too much for their system

devonorourke commented 5 years ago

Thanks Scott, There must be a way to break up the data.frame into a list of smaller data.frames, then iteratively request records, right? Or perhaps that list of data.frames can't be directly accessed with the bold_identify command? Maybe I need to break up the fasta into smaller chunks directly and run the R script as part of a loop.

Do you have a sense of what the largest number of fasta records you've collected with bold_identify is?

Thanks for the consideration

sckott commented 5 years ago

install from remotes::install_github("ropensci/bold@async"), reload R, and try e.g.,

library(tidyverse)
library(bold)
df <- read_delim(file = "https://github.com/devonorourke/tidybug/raw/master/data/qiime/dada2.arthASVs.txt.gz",
    delim = "\t", col_names = FALSE)
colnames(df) <- c("id", "seq")
system.time(out_async2 <- bold_identify_async(df$seq[1:60], db = "COX1"))
 #>   user  system elapsed
 #>  10.842   0.219  92.336

so 1.5 min for 60 sequences.

it's entirely possible if you throw 13K sequences using async requests at their servers they may not be able to handle that, so i'd still break up the sequnences into chunks, e.g., you could do 100 at a time

i may or may not keep this async function

devonorourke commented 5 years ago

Thanks Scott, Giving the new script a shot now.

devonorourke commented 5 years ago

Update for you @sckott - I haven't figure out why yet, but the issue isn't apparently with the batch size of query sequences being submitted. It keeps stalling at a single fasta record: number 1191.

For example, if you try:

out_list_nope1 <- bold_identify_async(df$seq[1190:1192], db = "COX1")

That won't work, but you can generate data frames on either side of the 1191th row:

out_list_yep1 <- bold_identify_async(df$seq[1189:1190], db = "COX1")
out_list_yep2 <- bold_identify_async(df$seq[1192:1193], db = "COX1")

And sure enough, if you try just that single bad record, you get the same error message as the very start of this thread:

out_list_nope2 <- bold_identify_async(df$seq[1191:1191], db = "COX1")

which throws this error:

Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html,  : 
  EntityRef: expecting ';' [23]

What exactly is that error message indicating? What I have read about online suggests that it's indicating that something in the XML parsing is going wrong, and that a ; character might need to be escaped somewhere in the code.

I looked at the record that's breaking things, and there is nothing strange about the reference sequence (just ATCG characters, no white space, no weird delimiters or non-ATCG alphabet... perfectly normal). There isn't anything strange in the sequence ID either, but it's not like the bold_identify function cares about that, right? Here's the sequence:

>prob
AACCCTATACTTTTTATTTGGAATTTGAGCGGGTATAGTAGGTACTAGCTTAAGTATATTAATTCGTCTAGAGCTAGGACAACCCGGTGTATTTTTAGAAGATGACCAAACCTATAACGTTATTGTAACAGCCCACGCTTTTATTATAATTTTCTTCATAATTATACCAATCATAATTGGA

I pasted that sequence in BOLD's online identification page and it returned a typical list of matches which makes me further think that there isn't a problem on my end in generating this error (but I could be (and usually am) wrong 😄!)...

Is there any possibility that the species names being returned in that list aren't being properly parsed, and that's what's throwing the error? For instance, the species names in that list include things like:

sp. 2KJ&EM
sp. 2KJEM

Could it be that the & symbol isn't being escaped in the bold_identify script and it's what's breaking the program?

Thanks again for your continued support

sckott commented 5 years ago

thanks for the info. looks like its down to those ampersands. if we just replace any & with & instead, we don't get that error anymore. , reinstall, reload then try again. let me know if it works

devonorourke commented 5 years ago

Great, thanks @sckott .

To reinstall, do I use the remotes::install_github("ropensci/bold@async") command you suggested before? I tried doing that but it still threw the same error as before.

Thanks for the help!

sckott commented 5 years ago

i added that fix https://github.com/ropensci/bold/commit/9f7000171fabebdab20a41c9a85087a95dbf0996 to master branch, so just ropensci/bold without the async

sckott commented 5 years ago

okay, that fix is on async branch, so you can install from async to get both the fix and async

devonorourke commented 5 years ago

got it - thanks very much It looks like the update to the master branch worked perfect. When I reinstalled bold like this:

remotes::install_github("ropensci/bold")

I can see the following output:

Downloading GitHub repo ropensci/bold@master
Running `R CMD build`...
* checking for file ‘/tmp/Rtmpk2NGfC/remotes1e2e97767ceee/ropensci-bold-9f70001/DESCRIPTION’ ... OK
...
* building ‘bold_0.8.6.9139.tar.gz’

I can execute this command and avoid the typical error (it completes properly):

out_list1 <- bold_identify(df$seq[1190:1192], db = "COX1")

However, if I try installing the async branch...

remotes::install_github("ropensci/bold@async")
Downloading GitHub repo ropensci/bold@async
* checking for file ‘/tmp/RtmpcMys3b/remotes1ee1948e6731d/ropensci-bold-3c6f95d/DESCRIPTION’ ... OK
...
* building ‘bold_0.8.6.9139.tar.gz’

... it seems like things fail:

out_list1 <- bold_identify_async(df$seq[1190:1192], db = "COX1")
Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html,  : 
  EntityRef: expecting ';' [23]

So it seems like the master branch version worked, but the alternate branch didn't? Not sure what I am doing wrong. Thanks for all your help!

sckott commented 5 years ago

did you restart R after installing from the async branch? that should have worked

devonorourke commented 5 years ago

Yep; quit the R session and restarted.

I'll give it another shot.

devonorourke commented 5 years ago

Multiple attempts of installing, restarting R, and can't get the @async branch to cooperate. Tried both on local machine and on compute cluster without any success - the master branch is working fine though.

One related through: You already addressed a problem here I was having... when the user starts with a data.frame and uses bold_identify, the output doesn't include the query names as an output variable and isn't part of the row.names. Your solution of having the user convert their data.frame to a named list worked for me just fine, but I think it would be very helpful (if possible!) for the default bold_identify function to include both the query and the match as distinct fields in the output.

Thanks as always!

sckott commented 5 years ago

opened https://github.com/ropensci/bold/issues/63

sckott commented 5 years ago

fixed issue, didn't have ampersand replacement in the async specific parser

ropensci / bold

issue with `bold_identify` #62