Closed gadepallivs closed 8 years ago
Hi @Monty9 -- looks like the NCBI just doesn't have this information. You can check out the XML in your browser
But should all these articles have these fields? This is used when another article directly refers to this one one (as in a "News and Views" piece in the same issue of the journal, or an article in reply or retraction notice)
Closing this now, but check out plyr::rbind.fill
for creating tables from list with possibly missing elements.
Is it rbind_fill ?
I thought it was too -- but turns out that's your version of the fxn in fulltext
!
But on second tought I'm not even sure you need this @Monty9 -- the elements are already in each list it's just that they have a NULL
value.
Ha. Woops
Hi david,
Sorry for a very very late response. Revisiting back to the issue with NULL records. When I started using the entrez package first implemented entrez_fetch
function.
data_pubmed = entrez_fetch(db = "pubmed", id = "25506969", rettype = "xml")
parse.records = parse_pubmed_xml (data_pubmed)```
In (function (paper) :
Pubmed record 25506969 is of type 'PubmedBookArticle' which rentrez doesn't know how to parse. Returning empty record
Hence, switched to entrez_summary
and extract_from_esummary
. As, we discussed above in this issue. Some records are NULl because NCBI has no info in the XML file. But, started noticing similarities ..for both the functions it is Bookarticles , reports, may be letters as well that returns NULL fields. Why is it these article type are not well recorded in NCBI?
However, when we search the PMID in NCBI it does gives us some info. Is there a work around to get information for these kind of articles ? I am trying to make an Rshiny application where user can get to see more curated version of the NCBI search and as well as additional info from other NCBI databases for each PMID. Loosing info, is setting me back. I am unaware of any work around.
# All these PMID have NULL fields.
PM.ID <- c("25834895","25506969"," 25032371"," 24983039","24983034","24983032","24983031")
pub.summary <- entrez_summary(db = "pubmed", id = PM.ID ,
always_return_list = TRUE)
pubrecord.extract <- extract_from_esummary(esummaries = pub.summary ,
elements = c("uid","title",
"fulljournalname",
"pubtype", "volume",
"issue", "pages",
"lastauthor",
"pmcrefcount",
"issn", "pubdate" ),
simplify = T)
Hi @Monty9 ,
Looks like "version 1.0" esummary objects have more information for these files (see the help for entrez_summary
to learn the difference).
If that's not enough, I think you will need a function that retrieves information form PubmedBookArticles from pubmed xml files.
If you look at the structure of xml record returned by
recs = entrez_fetch(db="pubmed", id=PM.ID, rettype = "xml", parsed=TRUE)
You could probably work out how to use xpath
queries to get information for each record? Then could parse the journal articles and the books separatelately based on their record type (check out the source for parse_one_pubmed
and parse_pubmed_xml
to get an idea of how to do that.
Hi David,
I went through the Source code for the above functions. In parse_one_pubmed
the xmlName(paper) retrieves the Child name from your given path. As I read the code, this functions runs on an object paper
which is basically record
which in turn is an parsed XML file. That is obtained from entrez_fetch
? But, when I do the same on xmlNames(recs)
an object obtained from
recs = entrez_fetch(db="pubmed", id=PM.ID, rettype = "xml", parsed=TRUE)
I did not get the expected info to check whether it is a PubmedArticle or PubmedBookArticle. I posted a Q on SO, if you have inputs please help me out. Thank you http://stackoverflow.com/questions/33484988/how-to-access-values-of-sub-nodes-child-with-different-names-in-xml-file
Hi @Monty9,
It can get a bit confusing when the functions are designed to be called in apply family functions.
In this case paper
is a single record in the XML record /PubmedArticleSet/*
.xmlName(paper)
tells you what kind of "paper" you are looking at. So if I were you I'd try to something like
parse_one_pubmed <- function(paper){
atype <- xmlName(paper)
if(atype == "PubmedArticle"){
return( parse_pubmed_article(paper) )
}
if(atype == "PubmedBookArticle"){
return( parse_pubmed_book(paper) )
}
warning("Encountered unknown record type'", atype, "' returning empty object")
NULL
}
Then you just need to write functions for the book and article types to extract the information you want (you could mostly copy the article). Did you check out the "version 1.0" summaries? They include some of the information you want, I think?
Hi David,
Thank you for the response. I was trying to do the same thing you suggested, I am failing to get the names. Will try to get the path and work around it. I tried the version 1.0, for entrez_summary
and learnt about the differences documented in R help/usage. It still returns empty values for the fields I needed. I thought extracting each field with xpath seems to be a good approach, your source code was very helpful to get the start. Thank you.
Hi david,
thought I will let you know how I fixed the issue of encountering PubmedArticle
, PubmedBookArticle
and articles with no abstract
below is the solution and it worked for me. more details on SO https://stackoverflow.com/questions/33484988/how-to-access-values-of-sub-nodes-child-with-different-names-in-xml-file
PM.ID <- c("25506969"," 25032371"," 24983039","24983034","24983032","24983031",
"26386083","26273372","26066373","25837167",
"25466451","25013473")
# rentrez function to retrieve XMl file for above PIMD
fetch.pubmed <- entrez_fetch(db = "pubmed", id = PM.ID,
rettype = "xml", parsed = T)
abstracts <- xpathSApply(fetch.pubmed, c('//PubmedArticle//Article',
'//PubmedBookArticle//Abstract'), function(x) {
xmlValue(xmlChildren(x)$Abstract) })
abstracts <- data.frame(abstracts,stringsAsFactors = F)
dim(abstracts)
rownames(abstracts) <- PM.ID
As an extension to this. I am trying to use web_histroy
instead of passing the PMIDs. When I do that, I am not sure how to extract all PMID for the web_history
object. It throws me an error invalid 'row.names' length
at rownames(abstracts)
The reason being I don't have all the PMIDs.
query = "BRAF[Title/Abstract] AND Lung[Title/Abstract] AND Cancer[Title] AND (2000[PDAT] :2010[PDAT])" # Search results in 21 hits
pubmed_search <- entrez_search(
db = "pubmed", term = query,
use_history = TRUE
)
abstracts <- xpathSApply(fetch.pubmed, c('//PubmedArticle//Article',
'//PubmedBookArticle//Abstract'), function(x) {
xmlValue(xmlChildren(x)$Abstract) })
abstracts <- data.frame(abstracts,stringsAsFactors = F)
dim(abstracts)
rownames(abstracts) <- pubmed_search$ids # has only 20 IDS, 1 id is missing
This works fine with web_history
object.But, not sure, why pubmed_search$ids
misses one PMID thereby throwing an error.
Is there a work around ?
My goal...
1) For a given query, let say we get 10,000 hits. I would like to extract info from pubmed viz View count, Citation count, title, abstract and links et. etc info for all the 10 K hits and rank those 10 K pmids on fly.
2) Also, I am trying to understand how the web history object works. If I do a query search, and it gives me 100 PMID hits ( for simplification), I input as web_history object for extraction of pubmed info. Then, let say if I do a 2nd query search and it results in 50 PMIDs. My question is
-- Will the 1st search web-history objects are replaced by 2nd search ? ( I prefer it appends and remembers my searches)
-- Now, of the 50 ids in my 2nd search, let say 20 ids overlap the 1st search. Will using the web_history objects only retrieves the pubmed info for the 30 ids ? ( considering I already extracted the info for 20 matching ids in the 1st search ?)
any inputs and suggestions are appreciated,
Thank you
Hi Monty,
When you use web_history you get all the IDs that match stored on the NCBI's server, while only at most retmax
IDs are in the returned object.
There are a few ways to get around this. Sometimes you have to use entrez_fetch
(with rettype=uilist
) to get just the IDs. In this case you can retieve the IDs from the XML
pmids <- xpathSApply(fetch.pubmed, "//ArticleId[@IdType='pubmed']", xmlValue)
For your other point (2). If you do a second query you will get a second web_history object which have the IDs for your second query. I don't think it's possible to append one set of IDs to an existing web history. Given this, I think you can see you would end up getting duplicates if IDs overlapped.
Hi David,
Noticed that default retmax is 20. But, if the web_history
object depends on the retmax set number and there is a limit of how many PMIDs I can search in NCBI. How do I retrieve the information ( Ref Count, View count, abstract, Title, Link, journal type etc) for all the hits ? Should I stick to parsing as XML ? And, What is the benefit of web_history
if it does not remember my past searches ?
retmax
only determines how many IDs are in the object returned by entrez_search
, all this hits are stored in the web history.
You can fetch every id using entrez_fetch
pmids = entrez_fetch(db="pubmed", web_history=pubmed_search$web_history, rettype="uilist")
strsplit(pmids, "\n")
[[1]]
[1] "21227397" "21102258" "20802351" "20526349" "20043261" "19956384"
[7] "19850405" "19472407" "19353596" "19238210" "19010912" "21479466"
[13] "18594528" "17891251" "17510423" "17075123" "17001163" "16376942"
[19] "16166444" "14601056" "12460918"
But since they are already in your XML file you might as well save yourself another call to the NCBI and parse them out.
The webhistory feature does remember your past searches -- you get a new webhistory object every time you use it. The advantages are described in the vignette.
Hi david,
Thank you for the reply. As you indicated I preferred parsing XML
for the Ids.
Hi @Monty9,
Can you open a new issue for this. It's a persistant error, but because I can never catch in the act I don't know what to check for to get a more useful error message to users.
@dwinter @sckott , hi david. It caught to my surprise today that view counts or articles views are not included for each article. For each article, I was able to extract the esummaries , and it included viewcount as well. But, I do not see that now. Can you help me understand this change..Thank you
PM.ID <- c("25834895","25506969"," 25032371"," 24983039","24983034","24983032","24983031")
pub.summary <- entrez_summary(db = "pubmed", id = PM.ID ,
always_return_list = TRUE)
pubrecord.extract <- extract_from_esummary(esummaries = pub.summary ,
elements = c("uid","title",
"fulljournalname",
"pubtype", "volume",
"issue", "pages",
"lastauthor",
"pmcrefcount","viewcount",
"issn", "pubdate" ),
simplify = T)
when I print pub.summary, I don't see the option for view count. I did successfully extracted the view count. Is there any other way I can get the view counts of each article ?
update : Contacted NCBI pubmedcentral@ncbi.nlm.nih.gov. , they replied back saying this The field viewcount was an outdated internal field. The data within this field was incorrect and had not been updated in several years. Because we had no mechanism to replace or fix this we removed the field so that erroneous data would not be propagated.
Hi david, I am trying to extract the refsource of a journal article using
entrez_summary
andextract_from_esummary
However,entrez_summary
is unable to retrieve reference information of all the PMIDs, it misses few. Please find the example below. Not sure if it is NCBI issue per se. Is there an alternate way you can suggest me to get the resource. I am trying to create a dataframe with PMID and Refsource.