ropensci / rentrez

talk with NCBI entrez using R
https://docs.ropensci.org/rentrez
Other
194 stars 38 forks source link

How to output metadata of interest? #183

Closed michael-mazzucco closed 1 year ago

michael-mazzucco commented 1 year ago

Hello! Thank you for maintaining this amazing package, I was following this post's code: https://quantixed.org/2021/04/04/ten-years-vs-the-spread-ii-calculating-publication-lag-times-in-r/ and was amazed at the ability to output received, accepted and published dates/gaps between them. Would there be a way to get any of the following:

-number of authors (could write a counter for separators on this one to be fair) -first author affiliation -last author affiliation -number of citations per article -degree of the first author

Or to see the full output of what is able to be pulled? Any advice is appreciated, thanks again!

michael-mazzucco commented 1 year ago

reprex for clarity as suggested by stack overflow:

#load in packages
library(reprex)
library(devtools)
#> Loading required package: usethis
install_github("ropensci/rentrez")
#> Skipping install of 'rentrez' from a github remote, the SHA1 (a225f213) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(rentrez)
require(XML)
#> Loading required package: XML
require(ggplot2)
#> Loading required package: ggplot2
require(ggridges)
#> Loading required package: ggridges
require(gridExtra)
#> Loading required package: gridExtra
# search pubmed using a search term (use_history allows retrieval of all records)
pp <- entrez_search(db="pubmed", term="cell[ta] AND 2010 : 2021[pdat] AND (journal article[pt] NOT review[pt] NOT comment[pt]
                    NOT autobiography[pt] NOT biography[pt] NOT case reports[pt] NOT clinical trial[pt]
                    NOT historical article[pt] NOT comparative study[pt] NOT evaluation study[pt]
                    NOT evaluation study[pt] NOT introductory journal article[pt])", use_history = TRUE)
pp_rec <- entrez_fetch(db="pubmed", web_history=pp$web_history, rettype="xml", parsed=TRUE)
# save records as XML file
saveXML(pp_rec, file = "Data/records.xml")
#> Error in saveXML(pp_rec, file = "Data/records.xml"): cannot create file Data/records.xml. Check the directory exists and permissions are appropriate
filename <- "~/Data/records.xml"
## extract a data frame from XML file
## This is modified from christopherBelter's pubmedXML R code
extract_xml <- function(theFile) {
  library(XML)
  newData <- xmlParse(theFile)
  records <- getNodeSet(newData, "//PubmedArticle")
  pmid <- xpathSApply(newData,"//MedlineCitation/PMID", xmlValue)
  doi <- lapply(records, xpathSApply, ".//ELocationID[@EIdType = \"doi\"]", xmlValue)
  doi[sapply(doi, is.list)] <- NA
  doi <- unlist(doi)
  authLast <- lapply(records, xpathSApply, ".//Author/LastName", xmlValue)
  authLast[sapply(authLast, is.list)] <- NA
  authInit <- lapply(records, xpathSApply, ".//Author/Initials", xmlValue)
  authInit[sapply(authInit, is.list)] <- NA
  authors <- mapply(paste, authLast, authInit, collapse = "|")
  authAffil <- lapply(records, xpathSApply, ".//Author/AffiliationInfo", xmlValue)
  authAffil[sapply(authAffil, is.list)] <- NA
  authAffil <- sapply(authAffil, paste, collapse = "|")
  theDF <- data.frame(pmid, doi, authors,authAffil, stringsAsFactors = FALSE)

  return(theDF)
}
#extract into a dataframe
theData <- extract_xml(filename)
#show the author affiliations as bunched
print(theData$authAffil[1])
#> [1] "Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA. Electronic address: kjsiddle@broadinstitute.org.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA; Department of Immunology and Infectious Diseases, Harvard T.H. Chan School of Public Health, Harvard University, Boston, MA 02115, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Division of Infectious Diseases, Massachusetts General Hospital, Boston, MA 02114, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Faculty of Arts and Sciences, Harvard University, Cambridge, MA 02138, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Department of Systems Biology, Harvard Medical School, Boston, MA 02115, USA.|Department of Epidemiology, Harvard T.H. Chan School of Public Health, Harvard University, Boston, MA 02115, USA; Center for Communicable Disease Dynamics, Department of Epidemiology, Harvard T. H. Chan School of Public Health, Harvard University, Boston, MA 02115, USA.|Department of Epidemiology, Harvard T.H. Chan School of Public Health, Harvard University, Boston, MA 02115, USA; Center for Communicable Disease Dynamics, Department of Epidemiology, Harvard T. H. Chan School of Public Health, Harvard University, Boston, MA 02115, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA; Applied Epidemiology Fellowship, Council of State and Territorial Epidemiologists, Atlanta, GA 30345, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Barnstable County Department of Health and the Environment, Barnstable, MA 02630, USA.|Barnstable County Department of Health and the Environment, Barnstable, MA 02630, USA.|Barnstable County Department of Health and the Environment, Barnstable, MA 02630, USA.|Barnstable County Department of Human Services, Barnstable, MA 02630, USA.|Community Tracing Collaborative, Commonwealth of Massachusetts, Boston, MA 02199, USA.|Community Tracing Collaborative, Commonwealth of Massachusetts, Boston, MA 02199, USA.|Community Tracing Collaborative, Commonwealth of Massachusetts, Boston, MA 02199, USA.|Department of Epidemiology, Harvard T.H. Chan School of Public Health, Harvard University, Boston, MA 02115, USA; Center for Communicable Disease Dynamics, Department of Epidemiology, Harvard T. H. Chan School of Public Health, Harvard University, Boston, MA 02115, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Massachusetts Department of Public Health, Boston, MA 02199, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Department of Immunology and Infectious Diseases, Harvard T.H. Chan School of Public Health, Harvard University, Boston, MA 02115, USA; Massachusetts Consortium for Pathogen Readiness, Boston, MA 02115, USA. Electronic address: bronwyn@broadinstitute.org.|Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA; Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA; Department of Immunology and Infectious Diseases, Harvard T.H. Chan School of Public Health, Harvard University, Boston, MA 02115, USA; Howard Hughes Medical Institute, Chevy Chase, MD 20815, USA; Massachusetts Consortium for Pathogen Readiness, Boston, MA 02115, USA."

Created on 2022-11-05 with reprex v2.0.2

quantixed commented 1 year ago

Hello @mickmars51, I would say that this is more of a question about data wrangling with the output from rentrez, not an issue with rentrez itself and should be closed. I have answered your question on SO: https://stackoverflow.com/a/74339329/12286645

You can inspect the xml file, pulled by entrez_fetch() in your code example, saved in Data/records.xml to see what can be parsed from it. AFAIK, the call you are making in your example pulls all available data for the records.

michael-mazzucco commented 1 year ago

hello @quantixed thank you for giving me an answer! also thank you for writing the original article, huge fan of your work. unfortunately the data wrangling approach won't solve my issue due to a lack of concordance between author and affiliation order (shown in SO comment example). however, the real issue I am trying to get at with affiliation is what country the senior and first authors are publishing from so maybe that information can be pulled in another way and stitched?

I'd like to leave this up for about a week just to see suggestions as I agree this isn't a rentrez issue but more of a question on how I can best utilize its capabilities. especially the pmcrefcount I think would be fascinating to see if lag times do correlate with citation volume or if it has changed over time. hopefully it will be solved and help others in the future with similar questions. if I do find a fix I will be sure to comment here and close immediately. thank you again!