ropensci / rentrez

talk with NCBI entrez using R
https://docs.ropensci.org/rentrez
Other
195 stars 38 forks source link

entrez_summary() unable to retrieve references/ refsource for certain PMIDs #67

Closed gadepallivs closed 8 years ago

gadepallivs commented 8 years ago

Hi david, I am trying to extract the refsource of a journal article using entrez_summary and extract_from_esummary However, entrez_summary is unable to retrieve reference information of all the PMIDs, it misses few. Please find the example below. Not sure if it is NCBI issue per se. Is there an alternate way you can suggest me to get the resource. I am trying to create a dataframe with PMID and Refsource.

PMIDs <- c( "26287849"," 25979833"," 25667274", "25430497","24968756", "24846037")

pub.summary <- entrez_summary(db = "pubmed", id =PMIDs, always_return_list = TRUE) 

> pub.summary$`26287849`$references$refsource
[1] "N Engl J Med. 2015 Aug 20;373(8):691-3"
> pub.summary$`25979833`$references$refsource
NULL
> pub.summary$`25667274`$references$refsource
[1] "J Clin Oncol. 2015 Mar 20;33(9):975-7"
> pub.summary$`25430497`$references$refsource
NULL
> pub.summary$`24968756`$references$refsource
NULL
> pub.summary$`24846037`$references$refsource
[1] "JAMA. 2014 May 21;311(19):1975-6"
dwinter commented 8 years ago

Hi @Monty9 -- looks like the NCBI just doesn't have this information. You can check out the XML in your browser

But should all these articles have these fields? This is used when another article directly refers to this one one (as in a "News and Views" piece in the same issue of the journal, or an article in reply or retraction notice)

dwinter commented 8 years ago

Closing this now, but check out plyr::rbind.fill for creating tables from list with possibly missing elements.

sckott commented 8 years ago

Is it rbind_fill ?

dwinter commented 8 years ago

I thought it was too -- but turns out that's your version of the fxn in fulltext !

But on second tought I'm not even sure you need this @Monty9 -- the elements are already in each list it's just that they have a NULL value.

sckott commented 8 years ago

Ha. Woops

gadepallivs commented 8 years ago

Hi david, Sorry for a very very late response. Revisiting back to the issue with NULL records. When I started using the entrez package first implemented entrez_fetch function.

data_pubmed = entrez_fetch(db = "pubmed", id = "25506969", rettype = "xml")
parse.records = parse_pubmed_xml (data_pubmed)```
In (function (paper)  :
Pubmed record 25506969 is of type 'PubmedBookArticle' which  rentrez doesn't know how to parse. Returning empty record

Hence, switched to entrez_summary and extract_from_esummary . As, we discussed above in this issue. Some records are NULl because NCBI has no info in the XML file. But, started noticing similarities ..for both the functions it is Bookarticles , reports, may be letters as well that returns NULL fields. Why is it these article type are not well recorded in NCBI?

However, when we search the PMID in NCBI it does gives us some info. Is there a work around to get information for these kind of articles ? I am trying to make an Rshiny application where user can get to see more curated version of the NCBI search and as well as additional info from other NCBI databases for each PMID. Loosing info, is setting me back. I am unaware of any work around.

# All these PMID have NULL fields.
PM.ID <- c("25834895","25506969"," 25032371"," 24983039","24983034","24983032","24983031") 
pub.summary <- entrez_summary(db = "pubmed", id = PM.ID ,
                              always_return_list = TRUE)
pubrecord.extract <- extract_from_esummary(esummaries = pub.summary ,
                                           elements = c("uid","title",
                                                        "fulljournalname",
                                                        "pubtype", "volume",
                                                        "issue", "pages",
                                                        "lastauthor",
                                                        "pmcrefcount",
                                                        "issn", "pubdate" ),
                                           simplify = T)
dwinter commented 8 years ago

Hi @Monty9 ,

Looks like "version 1.0" esummary objects have more information for these files (see the help for entrez_summary to learn the difference).

If that's not enough, I think you will need a function that retrieves information form PubmedBookArticles from pubmed xml files.

If you look at the structure of xml record returned by

recs = entrez_fetch(db="pubmed", id=PM.ID, rettype = "xml", parsed=TRUE)

You could probably work out how to use xpath queries to get information for each record? Then could parse the journal articles and the books separatelately based on their record type (check out the source for parse_one_pubmed and parse_pubmed_xml to get an idea of how to do that.

gadepallivs commented 8 years ago

Hi David, I went through the Source code for the above functions. In parse_one_pubmed the xmlName(paper) retrieves the Child name from your given path. As I read the code, this functions runs on an object paper which is basically record which in turn is an parsed XML file. That is obtained from entrez_fetch ? But, when I do the same on xmlNames(recs) an object obtained from

recs = entrez_fetch(db="pubmed", id=PM.ID, rettype = "xml", parsed=TRUE)

I did not get the expected info to check whether it is a PubmedArticle or PubmedBookArticle. I posted a Q on SO, if you have inputs please help me out. Thank you http://stackoverflow.com/questions/33484988/how-to-access-values-of-sub-nodes-child-with-different-names-in-xml-file

dwinter commented 8 years ago

Hi @Monty9,

It can get a bit confusing when the functions are designed to be called in apply family functions.

In this case paper is a single record in the XML record /PubmedArticleSet/*.xmlName(paper) tells you what kind of "paper" you are looking at. So if I were you I'd try to something like

parse_one_pubmed <- function(paper){
  atype <- xmlName(paper)
  if(atype == "PubmedArticle"){
    return( parse_pubmed_article(paper) )
  }
  if(atype == "PubmedBookArticle"){
    return( parse_pubmed_book(paper)  )
  }
  warning("Encountered unknown record type'", atype, "' returning empty object")
  NULL
}

Then you just need to write functions for the book and article types to extract the information you want (you could mostly copy the article). Did you check out the "version 1.0" summaries? They include some of the information you want, I think?

gadepallivs commented 8 years ago

Hi David, Thank you for the response. I was trying to do the same thing you suggested, I am failing to get the names. Will try to get the path and work around it. I tried the version 1.0, for entrez_summary and learnt about the differences documented in R help/usage. It still returns empty values for the fields I needed. I thought extracting each field with xpath seems to be a good approach, your source code was very helpful to get the start. Thank you.

gadepallivs commented 8 years ago

Hi david, thought I will let you know how I fixed the issue of encountering PubmedArticle, PubmedBookArticle and articles with no abstract below is the solution and it worked for me. more details on SO https://stackoverflow.com/questions/33484988/how-to-access-values-of-sub-nodes-child-with-different-names-in-xml-file

PM.ID <- c("25506969"," 25032371","   24983039","24983034","24983032","24983031",
"26386083","26273372","26066373","25837167",
 "25466451","25013473")
# rentrez function to retrieve XMl file for above PIMD
fetch.pubmed <- entrez_fetch(db = "pubmed", id = PM.ID,
                             rettype = "xml", parsed = T)
abstracts <- xpathSApply(fetch.pubmed, c('//PubmedArticle//Article',
     '//PubmedBookArticle//Abstract'), function(x) {
  xmlValue(xmlChildren(x)$Abstract) })
abstracts <- data.frame(abstracts,stringsAsFactors = F)
dim(abstracts)
rownames(abstracts) <- PM.ID
gadepallivs commented 8 years ago

As an extension to this. I am trying to use web_histroy instead of passing the PMIDs. When I do that, I am not sure how to extract all PMID for the web_history object. It throws me an error invalid 'row.names' length at rownames(abstracts)The reason being I don't have all the PMIDs.

query = "BRAF[Title/Abstract] AND Lung[Title/Abstract] AND Cancer[Title] AND (2000[PDAT] :2010[PDAT])" # Search results in 21 hits
pubmed_search <- entrez_search(
  db = "pubmed", term = query,
  use_history = TRUE
)

abstracts <- xpathSApply(fetch.pubmed, c('//PubmedArticle//Article',
                                         '//PubmedBookArticle//Abstract'), function(x) {
                                           xmlValue(xmlChildren(x)$Abstract) })
abstracts <- data.frame(abstracts,stringsAsFactors = F)
dim(abstracts)
rownames(abstracts) <- pubmed_search$ids # has only 20 IDS, 1 id is missing

This works fine with web_history object.But, not sure, why pubmed_search$ids misses one PMID thereby throwing an error. Is there a work around ? My goal... 1) For a given query, let say we get 10,000 hits. I would like to extract info from pubmed viz View count, Citation count, title, abstract and links et. etc info for all the 10 K hits and rank those 10 K pmids on fly. 2) Also, I am trying to understand how the web history object works. If I do a query search, and it gives me 100 PMID hits ( for simplification), I input as web_history object for extraction of pubmed info. Then, let say if I do a 2nd query search and it results in 50 PMIDs. My question is -- Will the 1st search web-history objects are replaced by 2nd search ? ( I prefer it appends and remembers my searches)
-- Now, of the 50 ids in my 2nd search, let say 20 ids overlap the 1st search. Will using the web_history objects only retrieves the pubmed info for the 30 ids ? ( considering I already extracted the info for 20 matching ids in the 1st search ?) any inputs and suggestions are appreciated, Thank you

dwinter commented 8 years ago

Hi Monty,

When you use web_history you get all the IDs that match stored on the NCBI's server, while only at most retmax IDs are in the returned object.

There are a few ways to get around this. Sometimes you have to use entrez_fetch (with rettype=uilist) to get just the IDs. In this case you can retieve the IDs from the XML

pmids <- xpathSApply(fetch.pubmed, "//ArticleId[@IdType='pubmed']", xmlValue)

For your other point (2). If you do a second query you will get a second web_history object which have the IDs for your second query. I don't think it's possible to append one set of IDs to an existing web history. Given this, I think you can see you would end up getting duplicates if IDs overlapped.

gadepallivs commented 8 years ago

Hi David, Noticed that default retmax is 20. But, if the web_history object depends on the retmax set number and there is a limit of how many PMIDs I can search in NCBI. How do I retrieve the information ( Ref Count, View count, abstract, Title, Link, journal type etc) for all the hits ? Should I stick to parsing as XML ? And, What is the benefit of web_history if it does not remember my past searches ?

dwinter commented 8 years ago

retmax only determines how many IDs are in the object returned by entrez_search, all this hits are stored in the web history.

You can fetch every id using entrez_fetch

pmids = entrez_fetch(db="pubmed", web_history=pubmed_search$web_history, rettype="uilist")
strsplit(pmids, "\n")
[[1]]
 [1] "21227397" "21102258" "20802351" "20526349" "20043261" "19956384"
 [7] "19850405" "19472407" "19353596" "19238210" "19010912" "21479466"
[13] "18594528" "17891251" "17510423" "17075123" "17001163" "16376942"
[19] "16166444" "14601056" "12460918"

But since they are already in your XML file you might as well save yourself another call to the NCBI and parse them out.

The webhistory feature does remember your past searches -- you get a new webhistory object every time you use it. The advantages are described in the vignette.

gadepallivs commented 8 years ago

Hi david, Thank you for the reply. As you indicated I preferred parsing XML for the Ids.

dwinter commented 8 years ago

Hi @Monty9,

Can you open a new issue for this. It's a persistant error, but because I can never catch in the act I don't know what to check for to get a more useful error message to users.

gadepallivs commented 8 years ago

@dwinter @sckott , hi david. It caught to my surprise today that view counts or articles views are not included for each article. For each article, I was able to extract the esummaries , and it included viewcount as well. But, I do not see that now. Can you help me understand this change..Thank you

PM.ID <- c("25834895","25506969"," 25032371"," 24983039","24983034","24983032","24983031") 
pub.summary <- entrez_summary(db = "pubmed", id = PM.ID ,
                              always_return_list = TRUE)
pubrecord.extract <- extract_from_esummary(esummaries = pub.summary ,
                                           elements = c("uid","title",
                                                        "fulljournalname",
                                                        "pubtype", "volume",
                                                        "issue", "pages",
                                                        "lastauthor",
                                                        "pmcrefcount","viewcount",
                                                        "issn", "pubdate" ),
                                           simplify = T)

when I print pub.summary, I don't see the option for view count. I did successfully extracted the view count. Is there any other way I can get the view counts of each article ?

update : Contacted NCBI pubmedcentral@ncbi.nlm.nih.gov. , they replied back saying this The field viewcount was an outdated internal field. The data within this field was incorrect and had not been updated in several years. Because we had no mechanism to replace or fix this we removed the field so that erroneous data would not be propagated.