Unable to parse multiple affiliations by author and pmid using entrez_fetch

santoshmungle commented 7 years ago

library(rentrez)
library(XML)

pubmedSearch <- entrez_search("pubmed", term = "flexible ureteroscope Simulation Model", 
                              retmax = 10)
SearchResults <- entrez_fetch(db="pubmed", pubmedSearch$ids, rettype="xml", 
                              parsed=TRUE)

xmlGetValue <- function(x, node){
  a <- xpathSApply(x, node, xmlValue)
  if(length(a) == 0) {a <- NA} else {a}
}

parse_paper <- function(paper){
  pmid <- xmlGetValue(paper, ".//ArticleId[@IdType='pubmed']")
  first_names <- xmlGetValue(paper, ".//Author/ForeName")
  last_names <- xmlGetValue(paper, ".//Author/LastName")
  affiliation <- xmlGetValue(paper, ".//AffiliationInfo/Affiliation")
  data.frame(pmid=pmid, first_names=first_names, last_names=last_names,
             affiliation=affiliation)
}  

parse_multiple_papers <- function(papers){
  res <- xpathApply(papers, "/PubmedArticleSet/*", parse_paper)
  do.call(rbind.data.frame, res)
}

test_df <- parse_multiple_papers(SearchResults)

Above is the code, I am using to parse affiliations by each author and pmid(article id). It works except when author have multiple affiliations. When author have multiple affiliations the length of columns become different and it return error. I am expecting a result like below:

pmid         first_names  last_names              affiliation
27869504     Luca           Villa         Division of Experimental Oncology/Unit of Urology, URI , IRCCS Ospedale San Raffaele, Milan, Italy 
27869504     Luca           Villa         Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France
27869504     Tarik Emre     Şener         Department of Urology, Tenon Hospital, Pierre and Marie Curie University , Paris, France
27869504     Tarik Emre     Şener         Department of Urology, Marmara University School of Medicine , Istanbul, Turkey

dwinter commented 7 years ago

Hi, this is really the same problem as you were having earler: because the xpath queries are extracting infromation from all nodes with a given name the mapping from author to affiliation if broken when there is more than one affiliation per author (more than one entry under each Author node in the XML).

The way to fix this is to parse each <Author> tag by itself, and, if your want one row per author, paste multiple affliations into one ( my particular solution might not be the most powerful/efficient, but it works)

parse_author <- function(author){
  fn  <- xmlValue(author[["ForeName"]])
  ln  <- xmlValue(author[["LastName"]])
  aff <-paste(xpathApply(author, "AffiliationInfo/Affiliation", xmlValue), collapse="; ")
  list(forname=fn, lastname=ln, affiliation=aff)
}

parse_paper <- function(paper){
  author_info <- xpathApply(paper, ".//AuthorList/Author", parse_author)
  res <- do.call(rbind.data.frame, author_info)
  res$pmid <-xpathSApply(paper, ".//ArticleId[@IdType='pubmed']", xmlValue)
  res
}

parse_multiple_papers <- function(papers){
 res <- xpathApply(papers, "/PubmedArticleSet/*", parse_paper)
 do.call(rbind.data.frame, res)
}

head(parse_multiple_papers(SearchResults))

santoshmungle commented 7 years ago

Thanks. This is great. I am having an issue with this code when node for Author and Affiliation is absent. For example, for term = "23395881" in entrez_search, the Author and Affiliation node is absent. Earlier I was handling this using xmlGetValue instead of xmlValue.

xmlGetValue <- function(x, node){
  a <- xpathSApply(x, node, xmlValue)
  if(length(a) == 0) {a <- NA} else {a}
}

santoshmungle commented 7 years ago

I figured out the solution to my problem with the absence of node. Thank you for your help. It really means a lot to me.

dwinter commented 7 years ago

Hi @santosh26a ,

Glad you worked this out before I could get to it. Can you share your answer -- because these issues are google-able, it can be helpful to others (and me!) to share what you worked out .

santoshmungle commented 7 years ago

Sorry for getting back late to you.

Here what I did that worked for me:

parse_paper <- function(paper){
  author_info <- xpathApply(paper, ".//AuthorList/Author", parse_author)
  res <- do.call(rbind.data.frame, author_info)
  if (length(res)==0){
    res <- data.frame(forname=NA,lastname=NA, affiliation=NA)
  }
  res$pmid <-xpathSApply(paper, ".//ArticleId[@IdType='pubmed']", xmlValue)
  res
}

Sometime I was having node absent for author names and affiliation as well. Because of that I was having an empty dataframe res. So, res$pmid was causing an error. To handle that I just included if condition which worked very well for me.

Thanks again for helping me with the issue.

ropensci / rentrez

Unable to parse multiple affiliations by author and pmid using entrez_fetch #98