Problems parsing large downloads

pschloss commented 8 years ago

I'm trying to download the PubMed records that correspond to a search query that has 16,000 hits. When I try to parse this with parse_pubmed_xml it gives a "Segmentation fault: 11" error. Here's my code...

query <- "(microbiome OR microbiota) NOT review[ptyp]"
pubmed_search <- entrez_search(db="pubmed", term=query, use_history = TRUE)
pubmed_records <- entrez_fetch(db="pubmed", web_history=pubmed_search$web_history,
                            retmax=pubmed_search$count, rettype="xml", parsed=TRUE)
pubmed_parsed <- parse_pubmed_xml(pubmed_records)

I also tried using XML::xmlToList but it only seems to get 10000 of the records. As a work around, I can extract the PMIDs with pubmed_search$id and then make a query for each of those. But, I suspect that would probably get me in trouble with PubMed. Alternatively, I could break my search up by year hoping to bet under the threshold for parse_pubmed_xml (is there an explicit threshold?). What is the best way of doing what I'm trying to do?

dwinter commented 8 years ago

Thanks for reporting this @pschloss,

There are no special limits for parse_pubmed_xml, so I guess you are hitting up against some limit as to the size of an XML file that can be easily parsed by XPath expressions (trying to run the example I get a different error about the c_stack being full, which I guess means my laptop is failing to keep track of all the elements and relationships in the file).

I think the best work-around will probably be to take the records in chunks, and only process one chunk a time. So we can set up a function that will download 1000 records starting from an arbitrary starting place in the ID list:

fetch_and_parse <- function(start){
   cat(start,"\r") #let the user now where
   pubmed_records <- entrez_fetch(db="pubmed", web_history=pubmed_search$web_history,
                             retstart=start, retmax=1000, rettype="xml", parsed=TRUE)
   parse_pubmed_xml(pubmed_records)
 }
 pubmed_search <- entrez_search(db="pubmed", term=query, use_history = TRUE)
 pubmed_parsed <- lapply( rec_starts, fetch_and_parse)

(You will get some warnings about Books, which a now included in pubmed but not properly handled by rentrez yet.)

At this stage pubmed_parsed is a list of lists, so you can de-nest

one_list <- do.call(c, pubmed_parsed)

Now it's big long list of pubmed records. You can make it have a pretty print statement by changing its class:

class(one_list) <- c("multi_pubmed_record", "list")
one_list

List of 7000 pubmed records

 Grass, Gregor et al. (2015).  Forensic science international. 259:32-35 
 Fernández-Santoscoy, María et al. (2015).  Frontiers in cellular and infection microbiology. 5:93

Just playing around with this now, he webhistory object sent by the NCBI seems to go stale even in the short time during which the records are being processed. The speed at which the webhistories break seems to vary greatly, but if you hit the same problem you could try downloading all of the pubmed IDs and "chunking" through them. You can actually use entrez_fetch to get just the IDs:

pmids <- entrez_fetch(db="pubmed", web_history=pubmed_search$web_history, rettype="uilist")
pmids <- strsplit(pmids, "\n")[[1]]

pschloss commented 8 years ago

Many thanks for the help with this. It turned out that I needed to do your final option. entrez_fetch limited me to 500 PMID's at a time.

dwinter commented 8 years ago

Glad you got it sorted !

ropensci / rentrez

Problems parsing large downloads #70