Open stitam opened 6 months ago
It looks like entrez_link()
doesn't support the retmax
parameter (see The E-utilities In-Depth: Parameters, Syntax and More). The web history contains all the search hits (not affected by retmax
), so the subsequent entrez_link()
call using the web history returns them all as well.
entrez_fetch()
does support the retmax
parameter, so meta <- entrez_fetch(db = "biosample", web_history = biosample_uid$web_history, rettype = "full", retmode = "xml", retmax = 5)
returns the only 5 results.
https://www.nlm.nih.gov/dataguide/eutilities/history.html#storing-results-to-the-history-server-using-esearch may also have relevant information.
rentrez does have entrez_post()
that can be used to modify web history. Usually, the application is adding IDs to a web history. I can't speak to removing or limiting the IDs in a web history. I guess it might be possible to create a new web history from the existing one but I can't say how off the top of my head. You'd have to dig into the documentation.
Many thanks @allenbaron for pointing out that entrez_fetch() also supports retmax
I think this will solve my issue.
I can confirm that entrez_fetch()
works perfectly with histories; if a run entrez_search()
with history and receive e.g. dozens of thousands of ids, I can use a for
loop and iteratively adjust the retstart
argument to get all the data in chunks, using the history. If I want, I can also adjust the retmax
argument to control the size of the chunks. Thanks @allenbaron for pointing this out!
However, entrez_link()
behaviour is strange. What I want to do is run entrez_search()
to get many many ids, then run entrez_link()
to convert these ids and finally entrez_fetch()
to get the data; and I want to use histories for this flow. If I break down my entrez_search()
using retstart
and retmax
, then I will have multiple webhistory objects, but entrez_link()
will not respect them, which risks that I will have duplicates in the end. However, if entrez_search()
(with a huge retmax
value) returns all ids in one go, entrez_link()
will not convert all of them:
biosample_uid <- entrez_search(db = "biosample", term = "Escherichia coli", use_history = TRUE, retmax = 20000)
length(biosample_uid$ids)
20000
assembly_uid <- entrez_link(dbfrom = "biosample", web_history = biosample_uid$web_history, db = "assembly")
length(assembly_uid$links$biosample_assembly)
294
So in this case I lose a large number of ids. My questions:
entrez_link()
behaviour?Any idea how I can efficently e.g. convert all E.coli BioSample IDs to Assembly IDs and then fetch the assembly summaries?
Hi, I have recently played with rentrez, and I have noticed that not necessarily all linked ID will have data in the target database. I searched for taxon ids based on a term that I linked with the nuccore database. I noticed that even though there is a taxon ID linked with nuccore, it won't fetch you nuccore data. Usually, these IDs stand for general higher-level taxa or unclassified terms. Could be similar in your case? Also, why would you expect to get as many assemblies and as many biosample IDs you have?
Thanks @vtorda, I suspect my issue is different. I would not expect all biosamples to have a corresponding assembly but for e.g. Escherichia coli I would expect to find many biosamples and also many assemblies. If I query the NCBI Assembly database directly:
assembly_uid_no_link <- entrez_search(db = "assembly", term = "Escherichia coli", retmax = 20000)
length(assembly_uid_no_link$ids)
20000
So if I look for E. coli assemblies directly, I get 20k of them (FYI, there are currently >250k assemblies overall on NCBI), however, if I first look for E. coli biosamples and then link them to assemblies, I get 294. Since all assemblies have corresponting biosamples, I would expect to get the same number of assemblies from 1. requesting biosample ids and converting to assembly ids and 2. requesting assembly ids directly.
When I run
entrez_search()
withuse_history
andretmax = 5
, the function will respect this and return 5 UIDs:When I convert these to assembly UID without using web history, the
entrez_link()
will return 5 UIDs, as expected:However, when I use the web history, the function will return more than 5 UIDs:
Even if I set
retmax
:It seems this is not an
entrez_link()
thing, it also occurs e.g. withentrez_fetch()
:Again, ids instead of web history work fine:
Any idea what is going on here? Many thanks.