ropensci / rentrez

talk with NCBI entrez using R
https://docs.ropensci.org/rentrez
Other
194 stars 38 forks source link

PUBMED scraping MeSh associated terms #148

Closed SalvatoreRa closed 3 years ago

SalvatoreRa commented 4 years ago

Hi, everyone

I want to collect the Mesh terms for a series of pubmed articles. taking the example: https://www.nlm.nih.gov/bsd/disted/meshtutorial/principlesofmedlinesubjectindexing/exampleofmeshindexing/index.html I want to obtain in a df for each article something similar: mesh

I tried to follow the vignette (https://docs.ropensci.org/rentrez/)

library(rentrez)
hox_paper <- entrez_search(db="pubmed", term="10.1038/nature08789[doi]")
hox_paper$ids
hox_data <- entrez_link(db="all", id=hox_paper$ids, dbfrom="pubmed")
hox_data
hox_data$links
hox_proteins <- entrez_fetch(db="protein", id=hox_data$links$pubmed_protein, rettype="fasta")
cat(substr(hox_proteins, 1, 237))
hox_mesh<- entrez_fetch(db="protein", id=hox_data$links$pubmed_mesh_major, rettype = "full")

trying the code, I am obtaining a large (1,2 mb) charachter which is basically useless to exploit

alternatively I found a code snippet used for retrieving the abstract from the XML associated from an article:

library(rentrez)
library(XML)

your.ids <- c("26386083","26273372","26066373","25837167","25466451","25013473")
your.ids <- "26386083"
# rentrez function to get the data from pubmed db
fetch.pubmed <- entrez_fetch(db = "pubmed", id = your.ids,
                             rettype = "xml", parsed = T)
# Extract the Abstracts for the respective IDS.  
abstracts = xpathApply(fetch.pubmed, '//PubmedArticle//Article', function(x)
  xmlValue(xmlChildren(x)$Abstract))
# Change the abstract names with the IDS.
names(abstracts) <- your.ids
abstracts

This code work great to retrieve abstracts but I did not managed to adapt for retrieving mesh termes, and I wornder if it could possible to use something similar to retrieve mesh termes from the XML (unfortunately I am not routinely using XML).

thank you for the help

dwinter commented 3 years ago

Hi @SalvatoreRa , sorry about taking so long to get to this. Sadly, it seems the MeSH database doesn't allow XML records to be downloaded, so I'm not sure this will be possibele: https://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/?report=objectonly

You could get the 'full' (plain text) records for each MeSH hit like this

hox_mesh <- entrez_fetch(db="mesh", id=hox_data$links$pubmed_mesh_major, rettype="full")

And it might be possible to get parsable data with entrez_summary, depending on exactly what you are hoping to do wit it.