ropensci / rentrez

talk with NCBI entrez using R
https://docs.ropensci.org/rentrez
Other
196 stars 38 forks source link

Timeout Error only in loop #198

Open chemoton opened 5 days ago

chemoton commented 5 days ago

Hi,

I am trying to get the lineage data for a set of taxIDs. When I try it with entrez_fetch(), it is working fine: tax_rec <- entrez_fetch(db="taxonomy", id=coi1[1,1], rettype="xml", parsed=TRUE) where coi1 is a dataframe, where the first column is taxID However, when I try to loop through all IDs in the rows, it will always give the following error:

image

The error will appear at a random index, for example it parsed through 46 records the first time, then 89 for the second. I have tried to play around with httr:GET config as suggested in an other issue (#87 ), but it did not help. I have doubts that I even used it appropiately, as I could not find usage examples, the code ran however still producing the error abouve at random index.

I have found the following info in the wiki: Slowing rentrez down when you hit the rate-limit rentrez won't let you send requests to the NCBI at a rate higher than the rate-limit, but it is sometimes possible that they will arrive too close together an produce errors. If you are using rentrez functions in a for loop and find rate-limiting errors are occuring, you may consider adding a call to Sys.sleep(0.1) before each message sent to the NCBI. This will ensure you stay beloe the rate limit.

So I included it my loop, but it did not solve the issue either. As the individual requests always work, I highly doubt it is network or NCBI issue.

Full code for looping through IDs: y <- list() for (i in 1:nrow(coi1)){ if (coi1[i,1] == 1) { tax_rec <- entrez_fetch(db="taxonomy", id=coi1[i,1], rettype="xml", parsed=TRUE) tax_list <- XML::xmlToList(tax_rec) y[[i]] <- tax_list$Taxon$ScientificName Sys.sleep(0.1) } else { tax_rec <- entrez_fetch(db="taxonomy", id=coi1[i,1], rettype="xml", parsed=TRUE) tax_list <- XML::xmlToList(tax_rec) y[[i]] <- tax_list$Taxon$Lineage Sys.sleep(0.1) } }

Any input/help appreciated

allenbaron commented 5 days ago

I have found the following info in the wiki: Slowing rentrez down when you hit the rate-limit rentrez won't let you send requests to the NCBI at a rate higher than the rate-limit, but it is sometimes possible that they will arrive too close together an produce errors. If you are using rentrez functions in a for loop and find rate-limiting errors are occuring, you may consider adding a call to Sys.sleep(0.1) before each message sent to the NCBI. This will ensure you stay beloe the rate limit.

I think you're probably still have rate limiting issues. You are currently circumventing the rate control mechanisms of entrez_fetch() with the for loop. Instead of sending a bunch of requests with one ID each, it would be better to make a few requests (or possibly 1 request depending on how many) with many IDs. entrez_fetch() is vectorized.

Also, do you have an API key (https://www.biostars.org/p/299812/)? Some rate and API information can be found at https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/.