Closed pschloss closed 8 years ago
Thanks for reporting this @pschloss,
There are no special limits for parse_pubmed_xml
, so I guess you are hitting up against some limit as to the size of an XML file that can be easily parsed by XPath
expressions (trying to run the example I get a different error about the c_stack being full, which I guess means my laptop is failing to keep track of all the elements and relationships in the file).
I think the best work-around will probably be to take the records in chunks, and only process one chunk a time. So we can set up a function that will download 1000 records starting from an arbitrary starting place in the ID list:
fetch_and_parse <- function(start){
cat(start,"\r") #let the user now where
pubmed_records <- entrez_fetch(db="pubmed", web_history=pubmed_search$web_history,
retstart=start, retmax=1000, rettype="xml", parsed=TRUE)
parse_pubmed_xml(pubmed_records)
}
pubmed_search <- entrez_search(db="pubmed", term=query, use_history = TRUE)
pubmed_parsed <- lapply( rec_starts, fetch_and_parse)
(You will get some warnings about Books, which a now included in pubmed but not properly handled by rentrez yet.)
At this stage pubmed_parsed
is a list of lists, so you can de-nest
one_list <- do.call(c, pubmed_parsed)
Now it's big long list of pubmed records. You can make it have a pretty print statement by changing its class:
class(one_list) <- c("multi_pubmed_record", "list")
one_list
List of 7000 pubmed records
Grass, Gregor et al. (2015). Forensic science international. 259:32-35
Fernández-Santoscoy, María et al. (2015). Frontiers in cellular and infection microbiology. 5:93
Just playing around with this now, he webhistory object sent by the NCBI seems to go stale even in the short time during which the records are being processed. The speed at which the webhistories break seems to vary greatly, but if you hit the same problem you could try downloading all of the pubmed IDs and "chunking" through them. You can actually use entrez_fetch
to get just the IDs:
pmids <- entrez_fetch(db="pubmed", web_history=pubmed_search$web_history, rettype="uilist")
pmids <- strsplit(pmids, "\n")[[1]]
Many thanks for the help with this. It turned out that I needed to do your final option. entrez_fetch limited me to 500 PMID's at a time.
Glad you got it sorted !
I'm trying to download the PubMed records that correspond to a search query that has 16,000 hits. When I try to parse this with parse_pubmed_xml it gives a "Segmentation fault: 11" error. Here's my code...
I also tried using
XML::xmlToList
but it only seems to get 10000 of the records. As a work around, I can extract the PMIDs withpubmed_search$id
and then make a query for each of those. But, I suspect that would probably get me in trouble with PubMed. Alternatively, I could break my search up by year hoping to bet under the threshold forparse_pubmed_xml
(is there an explicit threshold?). What is the best way of doing what I'm trying to do?