ropensci / rentrez

talk with NCBI entrez using R
https://docs.ropensci.org/rentrez
Other
194 stars 38 forks source link

Using web history instead of ids returns more results than expected #193

Open stitam opened 6 months ago

stitam commented 6 months ago

When I run entrez_search() with use_history and retmax = 5, the function will respect this and return 5 UIDs:

biosample_uid <- entrez_search(db = "biosample", term = "Microthrix parvicella", use_history = TRUE, retmax = 5)
length(biosample_uid$ids)
5

When I convert these to assembly UID without using web history, the entrez_link() will return 5 UIDs, as expected:

assembly_uid <- entrez_link(dbfrom = "biosample", id = biosample_uid$ids, db = "assembly")
length(assembly_uid$links$biosample_assembly)
5

However, when I use the web history, the function will return more than 5 UIDs:

assembly_uid <- entrez_link(dbfrom = "biosample", web_history = biosample_uid$web_history, db = "assembly")
length(assembly_uid$links$biosample_assembly)
14

Even if I set retmax:

assembly_uid <- entrez_link(dbfrom = "biosample", web_history = biosample_uid$web_history, db = "assembly", retmax = 5)
length(assembly_uid$links$biosample_assembly)
14

It seems this is not an entrez_link() thing, it also occurs e.g. with entrez_fetch():

meta <- entrez_fetch(db = "biosample", web_history = biosample_uid$web_history, rettype = "full", retmode = "xml")
xmllist <- xml2::as_list(xml2::read_xml(meta))[[1]]
length(xmllist)
16

Again, ids instead of web history work fine:

meta <- entrez_fetch(db = "biosample", id = biosample_uid$ids, rettype = "full", retmode = "xml")
xmllist <- xml2::as_list(xml2::read_xml(meta))[[1]]
length(xmllist)
5

Any idea what is going on here? Many thanks.

allenbaron commented 6 months ago

It looks like entrez_link() doesn't support the retmax parameter (see The E-utilities In-Depth: Parameters, Syntax and More). The web history contains all the search hits (not affected by retmax), so the subsequent entrez_link() call using the web history returns them all as well.

entrez_fetch() does support the retmax parameter, so meta <- entrez_fetch(db = "biosample", web_history = biosample_uid$web_history, rettype = "full", retmode = "xml", retmax = 5) returns the only 5 results.

https://www.nlm.nih.gov/dataguide/eutilities/history.html#storing-results-to-the-history-server-using-esearch may also have relevant information.

rentrez does have entrez_post() that can be used to modify web history. Usually, the application is adding IDs to a web history. I can't speak to removing or limiting the IDs in a web history. I guess it might be possible to create a new web history from the existing one but I can't say how off the top of my head. You'd have to dig into the documentation.

stitam commented 6 months ago

Many thanks @allenbaron for pointing out that entrez_fetch() also supports retmax I think this will solve my issue.

stitam commented 6 months ago

I can confirm that entrez_fetch() works perfectly with histories; if a run entrez_search() with history and receive e.g. dozens of thousands of ids, I can use a for loop and iteratively adjust the retstart argument to get all the data in chunks, using the history. If I want, I can also adjust the retmax argument to control the size of the chunks. Thanks @allenbaron for pointing this out!

However, entrez_link() behaviour is strange. What I want to do is run entrez_search() to get many many ids, then run entrez_link() to convert these ids and finally entrez_fetch() to get the data; and I want to use histories for this flow. If I break down my entrez_search() using retstart and retmax, then I will have multiple webhistory objects, but entrez_link() will not respect them, which risks that I will have duplicates in the end. However, if entrez_search() (with a huge retmax value) returns all ids in one go, entrez_link() will not convert all of them:

biosample_uid <- entrez_search(db = "biosample", term = "Escherichia coli", use_history = TRUE, retmax = 20000)
length(biosample_uid$ids)
20000
assembly_uid <- entrez_link(dbfrom = "biosample", web_history = biosample_uid$web_history, db = "assembly")
length(assembly_uid$links$biosample_assembly)
294

So in this case I lose a large number of ids. My questions:

Any idea how I can efficently e.g. convert all E.coli BioSample IDs to Assembly IDs and then fetch the assembly summaries?

vtorda commented 5 months ago

Hi, I have recently played with rentrez, and I have noticed that not necessarily all linked ID will have data in the target database. I searched for taxon ids based on a term that I linked with the nuccore database. I noticed that even though there is a taxon ID linked with nuccore, it won't fetch you nuccore data. Usually, these IDs stand for general higher-level taxa or unclassified terms. Could be similar in your case? Also, why would you expect to get as many assemblies and as many biosample IDs you have?

stitam commented 4 months ago

Thanks @vtorda, I suspect my issue is different. I would not expect all biosamples to have a corresponding assembly but for e.g. Escherichia coli I would expect to find many biosamples and also many assemblies. If I query the NCBI Assembly database directly:

assembly_uid_no_link <- entrez_search(db = "assembly", term = "Escherichia coli", retmax = 20000)
length(assembly_uid_no_link$ids)
20000

So if I look for E. coli assemblies directly, I get 20k of them (FYI, there are currently >250k assemblies overall on NCBI), however, if I first look for E. coli biosamples and then link them to assemblies, I get 294. Since all assemblies have corresponting biosamples, I would expect to get the same number of assemblies from 1. requesting biosample ids and converting to assembly ids and 2. requesting assembly ids directly.