Closed markscherz closed 4 years ago
Hi @markscherz,
Can you send me the accessions and code you are running so I can attempt to replicate it and troubleshoot it? It is possible it is throttling potentially, although I have ran this function on thousands of sequences previously.
Best,, -Sam
Hey Sam,
library(AnnotationBustR)
library(reutils)
search <- esearch(term="txid8292[Organism:exp] 16S[title]",db = 'nuccore', usehistory = TRUE)
accessions <- efetch(search, rettype = "acc",retmode = "text",outfile = "amphibians16S.txt") #this returns about 56,000 accession numbers
accessions <- read.table("amphibians16S.txt",header = FALSE,stringsAsFactors = FALSE)
longphibs <- FindLongestSeq(accessions)
here's the amphibians16S.txt file so you don't have to generate it yourself (it takes a while): amphibians16S.txt
Cheers, Mark P.S. I do not know if you are aware, but your whole conversation on the chatroom for this project is publicly visible. You may want to make it private, if possible.
Hi @markscherz ,
I was not able to replicate it. It ran for me correctly and returned 7760 of what it found to be the longest sequences for each unique species. I have attached the R code and data object as well as a csv of the output.
That error message isn't from AnnotationBustR directly, but from the ape dependency not being able to find the accession. I believe I know what the problem is and it is a minor issue with your code. While you are reading your table in and are properly indicating that there is no header to the data, R still assigns column names to it as it is being read in as a table. As the function takes a vector, it still needs the specific vector of the data frame to be specified. As there are no column names, it should default to V1
, so you would need to do accessions$V1
. I'll clarify this in the vignette so it is more clear for those using reutils to find their sequences how to proceed. Let me know if this fixes your issue and I will close this after I update the vignette
Ah, excellent @sborstein! Thanks a lot for taking a look at this, and thanks for clarifying. It worked for me too.
I am hitting a wall with FindLongestSeq when the number of
Accessions
is in the tens of thousands (possibly hundreds for a single species, but currently unknown).Error in file(file, "r") : cannot open the connection to 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=c("MK720938&rettype=fasta&retmode=text' In addition: Warning message: In file(file, "r") : cannot open URL 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=c("MK720938&rettype=fasta&retmode=text': HTTP status was '400 Bad Request'
I guess NCBI is throttling access to the function at a certain number of requests per second. Any suggestions for how to overcome this? Also, this package seems great 👍