ropensci / rentrez

talk with NCBI entrez using R
https://docs.ropensci.org/rentrez
Other
195 stars 38 forks source link

Unexpected behaviour with entrez_summary() version default setting #105

Closed rossmounce closed 7 years ago

rossmounce commented 7 years ago

Default settings: unexpected zero

nseq <- entrez_search(db="nuccore", term="Piperales[ORGN]", retmax=9999, use_history=TRUE)
length(nseq$ids)
[1] 7110
x_summ <- entrez_summary(db="nuccore",
                         web_history= nseq$web_history,
                         rettype="xml")
length(x_summ)
[1] 0

Yet, when version "1.0" is specified, I get expected results:

nseq <- entrez_search(db="nuccore", term="Piperales[ORGN]", retmax=9999, use_history=TRUE)
x_summ <- entrez_summary(db="nuccore",
                         version="1.0",
                         web_history= nseq$web_history,
                         rettype="xml")
#x_summ <- entrez_summary(db="nuccore", id=nseq$ids)
length(x_summ)
[1] 7110

For completeness sake:

x_summ <- entrez_summary(db="nuccore",
                         version="2.0",
                         web_history= nseq$web_history,
                         rettype="xml")
#x_summ <- entrez_summary(db="nuccore", id=nseq$ids)
length(x_summ)
[1] 0

In the documentation for entrez_summary() I note it is written: "Existing scripts which relied on the structure and naming of the "Version 1.0" summary files can be updated by setting the new version argument to "1.0"." which perhaps hints at this issue?

Either way I don't understand this behaviour, or it might just need documentation?

dwinter commented 7 years ago

Hi @rossmounce ,

Thanks for this, digging into this, it seems two things are happening here

(a) rentrez is failing to parse out an error messge in the json that get's returned

[1] "{\n    \"header\": {\n        \"type\": \"esummary\",\n        \"version\": \"0.3\"\n    },\n    \"error\": \"Too many UIDs in request. Maximum number of UIDs is 500 for JSON format output.\"\n}\n\n"

(b) rentrez is failing to warn you that (or error out because) version 2.0 records are only available as json, so your setting of rettype is being over-ridden.

Seems like you can use the the XML record, or try and "chunk" the json records if they are easier to handle. It might make more sense to do this, as the esummary objects follow the naming style used @ NCBI, and json and xml records differ in this regard.

Will leave this open until entrez_summary can throw errors/warnings for these cases.

rossmounce commented 7 years ago

@dwinter thanks for the explanation.

I'd prefer to take larger chunks say 10,000 records at a time so I can get entire orders like Piperales[ORGN] in one chunk, so I'll stick with 1.0 & XML in the mean time.

What do you think would be an 'optimum' or recommended per chunk size for returning a large number of queries beyond the 100,000 limit of retmax (using retstart)? I know/fear that if I set my per chunk size too large it might make the process more liable to failure / connection issues?

dwinter commented 7 years ago

Hi @rossmounce , I suspect if you ask NCBI they are going to say 500 is the optimum.

Obviously you can get more than that with XML records. It is definitely the case that larger requests are more likely to fail (and when they fail they just fail, so you don't get intermediate results), but it is not really possible to give exact limits. In my experience connection issues vary day-to-day and by time of day (presumably as a function of the load on NCBI during US working hours, and whatever is going on server-side at the NCBI).

Sorry I can't provide more helpful advice on that front.

rossmounce commented 7 years ago

thx. I'll try 5000 first but if I get too high an error rate i'll reduce it down

dwinter commented 7 years ago

OK, think this should be dealt with in the develop branch. Will get it merged in master and on CRAN in the next few days