sborstein / AnnotationBustR

AnnotationBustR
5 stars 4 forks source link

Error in seqinr #19

Open ThomasLemarcis opened 3 years ago

ThomasLemarcis commented 3 years ago

Hi,

First of all thank you for this R package, it's a really useful tool.

Nevertheless I have an issue when I use the following script to find my sequences in Genbank:

**install.packages("AnnotationBustR") library(AnnotationBustR)

File with all the accessions:

ncbi.accessions<-c("MT374078","MN983150", "MN983149", "NC_049091", "NC_048962", "MT240815", "MT240814", "MT240813", "MT240812", "MT240811", "MT240810", "MT240809", "MT240808", "MT240807", "MT240806", "MT240805", "MT240804", "MT043269", "MN583349", "MW044625", "MW376482", "MW316790", "MW316791", "MW316792", "MW316795", "MW316796", "MW316798", "MT408027", "MT232845", "MT415926", "MN871953", "MT755650", "MT755651", "MT762151", "MT762153", "MT762152", "MT762154", "MT762155", "MT768039", "MT768040", "MT768043", "MT768044", "MT755652", "MT755854", "MT762157", "MT768042", "MT755653", "MT762156", "MT768041", "MT768045", "MW194096", "MW768045", "MT755649", "MW548267", "MH140432")

loads the mitochondrial DNA search terms for metazoans

data(mtDNAterms)

Run annotationBurst

my.seqs<-AnnotationBust(Accessions=ncbi.accessions, Terms=mtDNAterms, DuplicateSpecies=TRUE, Prefix="Demo", TidyAccessions=TRUE)**

Then I have this error message:

Error in seqinr::query(paste("SUB", paste0("AC=", new.access), "AND T=CDS", : Unable to get any answer from socket after 10 trials.

Surprisingly, I obtain this error at different steps, sometimes with the first GenBank accession number but other times with the second or the third one. I know about the "NC" sequences problem with seqinr but the first sequences in my list are not "NC" sequences.

Is there any problem with my script?

Thanks a lot in advance for your answer.

Best Regards.

Thomas Lemarcis.

PeteCowman commented 3 years ago

Hi,

I am also having the same error. As with Thomas it can happen at different accession number. No "NC_" accessions in my list.

I think it might have something to do with connecting to pbil.univ-lyon1.fr - I have had this issue intermittently, but this has been a reported issue with seqinr dependancy

Error in seqinr::query(paste("SUB", paste0("AC=", new.access), "AND T=CDS", : Unable to get any answer from socket after 10 trials. In addition: Warning message: In browseURL(paste0("http://127.0.0.1:", port, "/library/", pkgname, : closing unused connection 3 (->pbil.univ-lyon1.fr:5558)

Cheers,

Pete

sborstein commented 3 years ago

Hi Thomas and Pete,

Pete is correct, this is an issue with seqinr, a dependency we use to connect to the ACNUC server which is used to extract subsequences. This is not an issue with AnnotationBustR itself. This could be due to a few things. One, if you have a not so great internet connection, it might struggle to connect and/or keep a connection. The other and more likely reason is it could be that their server is down or having issues. They did just update ACNUC with the new GenBank release and I've noticed in the past that occasionally there are issues connecting to their server after these updates and it takes a while for it to get back to being steady. Unfortunately, I don't have a great solution for this as I'm not affiliated with either seqinr or the ACNUC team. I will look into it more and look into better connection stability when connecting to ACNUC via seqinr within AnnotationBustR, but again, this appears to be a seqinr specific issue.

As far as solutions go, it is difficult as AnnotationBustR requires a connection to ACNUC through seqinr. A few suggestions I can recommend. First, only extract what you need. For example, if you only need coding sequences, subset those out from mtDNAterms by doing something like cds.terms<-mtDNAterms[mtDNAterms$Type=="CDS",]. This will greatly speed up extraction as extracting multiple type of sequences increases run time, especially so when extracting mitochondrial D-Loop, but more important limit your time on their server. Obviously, this may not be applicaple if you are interested in extracting all components of a mitochondrial genome (CDS, rRNA, tRNA, D-Loop), but it should help if you only need a few loci.

The second, though more tedious is you can pick up from where your connection stopped and restart on whatever accession it stopped on and give the run a different name with the Prefix argument. This will obviously not ideal as it will require you to combine fasta files in the end, but it is one way around faulty connections to the ACNUC server. You could also try batching your accessions with this method (say by doing 5-10 at a time) with different prefixes as it seems that connection errors are most likely to occur when retrieving sequences for lots of Accessions.

Best, -Sam

ThomasLemarcis commented 3 years ago

Hi Sam and Pete,

Thanks for your answers there're really helpful.

I tried different strategies to find out if I can solve the problem:

None of these strategies worked...

But I have a new problem, my script now returns an error with the first sample. I don't know why now I can't do anything with my script, it's really strange, do you have any idea about how it changes since the first tries?

Thanks again for your answers.

Cheers,

Thomas.

sborstein commented 3 years ago

Hi Thomas,

Do you mean if I know if anything has changed with the ACNUC server it connects too? The latest seqinr release notes are dated to 2018, so it doesn't seem like they have changed the functions used to connect to ACNUC, so my guess is this is related to connecting to the ACNUC server itself. As I'm not affiliated with ANUC and the only access we have to it in R is through seqinr, I'm not sure if something on their end has changed. That being said, if you are continuing to get an error that is something a long the lines of Error in seqinr::query(paste("SUB", paste0("AC=", new.access), "AND T=CDS", : Unable to get any answer from socket after 10 trials. this indicates that for whatever reason it is failing to properly connect or is losing a connection to the remote ACNUC database via seqinr, whether it is on the first accession or a later one.

I just re-ran the accessions you sent and I was able to successfully extract them for several accessions before getting a connection error. The ACNUC server can be temperamental. There have been times prior where it had been down for some time and then was back to normal. I'm looking into this and will try to see if there are ways on my end to improve ACNUC connection stability, but I can't provide a timeline for when a push to GithHub/CRAN might occur.

Best, -Sam