ropensci / rentrez

talk with NCBI entrez using R
https://docs.ropensci.org/rentrez
Other
195 stars 38 forks source link

Get abstract of a given PMID #100

Closed vojtechhuser closed 7 years ago

vojtechhuser commented 7 years ago

For Pubmed db - what parameters are supported for entrez_fetch? Only XML?

How to best retrieve the abstract of an article?

library(rentrez)
s<-'JAMA [jour] AND ("2016/01/01"[PDat] : "2016/12/31"[PDat]) not "comment"[ptyp] '
res <- entrez_search(db="pubmed", term=s, retmax=80000)
pd<-hox_paper$ids[[444]]

r = entrez_summary(db="pubmed", id=head(hox_paper$ids,400))

r[[2]]$pubtype
f = entrez_fetch(db="pubmed", id=pd,rettype='xml',parsed=T)
f
vojtechhuser commented 7 years ago

solution with reismed is

(but I would prefer using rentrez)

search_topic <- s
search_query <- EUtilsSummary(search_topic, retmax=400 )
summary(search_query)

# see the ids of our returned query
QueryId(search_query)

# get actual data from PubMed
records<- EUtilsGet(search_query)
class(records)

# store it
pubmed_data <- data.frame('Title'=ArticleTitle(records),'Abstract'=AbstractText(records))
head(pubmed_data,1)
dwinter commented 7 years ago

Hi @vojtechhuser , thanks for your question.

entrez_fetch can get any format in this table, but can only parse XML documents 'on the fly'. As it happens, the NCBI provides an "abstract" format for pubmed.... which is XML.

So, do get the records you could do something like (the extra bit in the search filed should remove any papers that are indexed but have no abstract).

 s <-'JAMA [jour] AND ("2016/01/01"[PDat] : "2016/12/31"[PDat]) not "comment"[ptyp] and  "has abstract"[Filter] '
 pubmed_s <- entrez_search(db="pubmed", term=s, retmax=8)
 raw_abs <- entrez_fetch(db="pubmed", id=pubmed_s$ids, rettype="abstract")
 cat(substr(raw_abs, 1, 400), "\n")
<?xml version="1.0" ?>
<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2017//EN" "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_170101.dtd">
<PubmedArticleSet>
<PubmedArticle>
    <MedlineCitation Status="MEDLINE" Owner="NLM">
        <PMID Version="1">28027373</PMID>
        <DateCreated>
            <Year>2016</Year>
            <Month>12</Month>
            <Day>27</D 

Now you can use XML or xml2 to get what you want from the records. xml2 is generally considered to be more straightforward, but I only know the older XML well. So here is how I would extract the abstract text from this object.

parsed_abs <- XML::xmlTreeParse(raw_abs, useInternalNodes=TRUE)
XML::xpathSApply(parsed_abs, "//Abstract", XML::xmlValue)
just_the_abstracts <- XML::xpathSApply(parsed_abs, "//Abstract", XML::xmlValue)