ropensci / rentrez

talk with NCBI entrez using R
https://docs.ropensci.org/rentrez
Other
195 stars 38 forks source link

Lifespan of web_history object #110

Closed pschloss closed 6 years ago

pschloss commented 7 years ago

I have a big search where I'm trying to extract the doi from each record. The search hits about 3.5 million records and using the web_history object I pull them down 10000 at a time. When I run this, it consistently dies 46 hrs (~2.4 million records) into the job. I save the web_history object to an rdata file. When I read this back in a subsequent session, it's unable to start over, which makes me think that the web_history object has vaporized.

I can divide my work into smaller jobs to get the full download, but before refactoring my code, I was wondering whether the 46 hr window is a known "feature" of the web_history approach. For reproducibility purposes it might be nice to be able to store the object, but if not, then so be it.

Thanks, Pat

dwinter commented 7 years ago

Hi @pschloss,

web_historys (or really the things they point to) definitely seem to have a finite life on the NCBI's servers, but I'm not sure this something that is explicitly included in the NCBI docs. Let me get in touch with NCBI support and see if I can provide any details (and include the same in rentrez's docs).

dwinter commented 7 years ago

Hi @pschloss ,

Word from upon high is that WebHistory objects definitely have a short life time. There is no explicit time at which one if deleted on the server-side, but 8hrs is "a good rule of thumb".

I will keep this issue open until I've made a point of this in the vignette.

Not sure if there is much I can suggest for the more general problem -- if there is a way to batch up the search (using [PDAT] to process one year at a time?) I would do that. It is also possible to get paper URLs (but not necessarily dois...) using entrez_link:

s <- entrez_search(db="pubmed", term="Ascomycetes[ORGN]", retmax=20)
lo <- entrez_link(dbfrom = "pubmed", cmd="llinks", id =s$ids )
linkout_urls(elink = lo)
$ID_28783439
[1] "http://www.tandfonline.com/doi/full/10.1080/21505594.2017.1342920"

$ID_28782804
[1] "http://dx.doi.org/10.1111/nph.14714"

Sorry I can't be more help