ropensci / rentrez

talk with NCBI entrez using R
https://docs.ropensci.org/rentrez
Other
195 stars 38 forks source link

Unable to use entrez_post and web_history to get large amounts of data #97

Closed xhan85 closed 7 years ago

xhan85 commented 7 years ago

I am trying to get the XML information for about 1300 publications on PubMed and I already have a list of all of the ID's for the publications I am interested in. I haven't had any issues using entrez_fetch for several hundred records but I get

Error in entrez_check(response): HTTP failure 414, the request is too large. 
For large requests, try using web history as described in the rentrez tutorial

No problem. I follow the tutorial and have the following:

upload<-entrez_post(db="pubmed", id=my.list$IDs) trial1<-entrez_fetch(db="pubmed", rettype="xml", parsed=TRUE, web_history=upload)

But I can't get past the first line of code for some reason and I get the exact same error as above. I am very stumped. I thought entrez_post allowed me to post a large quantity of IDs onto the NCBI server for later use. But it just keeps saying that the request is too large and try using web history (but I can't since it won't even let me post my query...)

dwinter commented 7 years ago

Hi @xhan85, this is happening because the set vector of IDs (my.list$IDs) is too long, and unfortunately entrez_post won't help with that.

You can use the webhistory feature for entrez_search to skip the need for having this large vector in the first place. (Rather than fetching the IDs that match a search you can store that IDs as a webhistory object). Can we make the error message or the tutorial clearer on this point?

If you got your list of IDs from another source then the webhistory feature won't help, you will have to send the IDs in "batches" of a few hundred at a time.

xhan85 commented 7 years ago

Hi @dwinter, thank you so much for the explanation. I think the confusion for me was that I have this large vector of IDs because they correspond to specific grant numbers that I am interested in, and there's no easy search term that I could use with entrez_search that would get me the same list of IDs since the publications span all fields.

dwinter commented 7 years ago

Hi @xhan85, I don't know if it will help you, but it is possible to search for papers by grant ID. I think it will only work for NIH grants.

papers_by_grant <- entrez_search(db="pubmed", term="GM101352[GRNT]")
papers_summ <- entrez_summary(db="pubmed", id=papers_by_grant$ids)
cat(extract_from_esummary(papers_summ, "title"), sep="\n")
Hidden genetic variation in the germline genome of Tetrahymena thermophila.
Whole Genome Sequencing of Field Isolates Reveals Extensive Genetic Diversity in Plasmodium vivax from Colombia.
Neutral Models of Microbiome Evolution.
A composite genome approach to identify phylogenetically informative data from next-generation sequencing.
Mutational robustness of morphological traits in the ciliate Tetrahymena thermophila.
Accumulation of spontaneous mutations in the ciliate Tetrahymena thermophila.

But if you just have a large number of pmids I'm afraid you'll have to "chunk" them into a size that can be sent to NCBI. Something like this should generate groups of at most chunk_size.

ids <- sample(1e5, 1200)
chunk_size <- 300
ids_chunked <- split(ids, ceiling(seq_along(ids)/chunk_size))

I will leave this open until I update the tutorial to make this clear.

xhan85 commented 7 years ago

Thanks @dwinter! I ended up doing it in 3 batches so it all worked out and at least now I know it wasn't something that I was missing on my end!

sunitj commented 6 years ago

FWIW, I ran into the same issue and ended up taking the chunk approach. Not sure if this bit made it to the documentation. Thanks for the package though! :)