Error: cannot open connection with FindLongestSeq with tens of thousands of accessions

sborstein / AnnotationBustR

AnnotationBustR

5 stars 4 forks source link

Error: cannot open connection with FindLongestSeq with tens of thousands of accessions #17

Closed markscherz closed 4 years ago

markscherz commented 5 years ago

I am hitting a wall with FindLongestSeq when the number of Accessions is in the tens of thousands (possibly hundreds for a single species, but currently unknown).

Error in file(file, "r") : cannot open the connection to 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=c("MK720938&rettype=fasta&retmode=text' In addition: Warning message: In file(file, "r") : cannot open URL 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=c("MK720938&rettype=fasta&retmode=text': HTTP status was '400 Bad Request'

I guess NCBI is throttling access to the function at a certain number of requests per second. Any suggestions for how to overcome this? Also, this package seems great 👍

sborstein commented 5 years ago

Hi @markscherz,

Can you send me the accessions and code you are running so I can attempt to replicate it and troubleshoot it? It is possible it is throttling potentially, although I have ran this function on thousands of sequences previously.

Best,, -Sam

markscherz commented 5 years ago

Hey Sam,

library(AnnotationBustR)

library(reutils)

search <- esearch(term="txid8292[Organism:exp] 16S[title]",db = 'nuccore', usehistory = TRUE)

accessions <- efetch(search, rettype = "acc",retmode = "text",outfile = "amphibians16S.txt") #this returns about 56,000 accession numbers

accessions <- read.table("amphibians16S.txt",header = FALSE,stringsAsFactors = FALSE)

longphibs <- FindLongestSeq(accessions)

here's the amphibians16S.txt file so you don't have to generate it yourself (it takes a while): amphibians16S.txt

Cheers, Mark P.S. I do not know if you are aware, but your whole conversation on the chatroom for this project is publicly visible. You may want to make it private, if possible.

sborstein commented 5 years ago

Hi @markscherz ,

I was not able to replicate it. It ran for me correctly and returned 7760 of what it found to be the longest sequences for each unique species. I have attached the R code and data object as well as a csv of the output.

That error message isn't from AnnotationBustR directly, but from the ape dependency not being able to find the accession. I believe I know what the problem is and it is a minor issue with your code. While you are reading your table in and are properly indicating that there is no header to the data, R still assigns column names to it as it is being read in as a table. As the function takes a vector, it still needs the specific vector of the data frame to be specified. As there are no column names, it should default to V1, so you would need to do accessions$V1. I'll clarify this in the vignette so it is more clear for those using reutils to find their sequences how to proceed. Let me know if this fixes your issue and I will close this after I update the vignette

MarkAmphibians.zip

markscherz commented 5 years ago

Ah, excellent @sborstein! Thanks a lot for taking a look at this, and thanks for clarifying. It worked for me too.