functions to clearly parse Nucleotide database efetch?

arw36 commented 6 years ago

Hi,

I am using rentrez to get metadata for a large batch of sequences in the nucleotide database. I would also like to easily download the publications attached to each sequence, found online as Author, Title, and Journal. I know you can search by these terms (AUTH, JOUR, etc), but is there easy way to retrieve this info given a sequence id? Currently I have:

fetch1 <- entrez_fetch(db = "nucleotide", id = "AB000386.1", rettype = "native", parsed = T)
fetch_list <- xmlToList(fetch1)

but I find the fetch list to be really hard to navigate. Maybe I just need assistance with the parsing of this, perhaps an extract_from_efetch function?

Also interesting, entrez_link only gives papers that have cited the sequence, but not the original journal article attached to the sequence.

I couldn't find the author, journal, title information in the esummary output.

Any advice is much appreciated.

dwinter commented 6 years ago

Hi @arw36,

I agree that the XML records can be hard work to parse!

I made a design choice early on that rentrez wouldn't try to make user-friendly parsing functions for all of the databases it covers. I'd rather write a really solid low-level package that others can use to develop more "high-level" packages for particular uses.

In this case, I think you can save some fo the pain of handling those very dense xml files by fetching one of the more simple records:

fetch2 <- entrez_fetch(db = "nucleotide", id = "AB000386.1", 
                       rettype = "gbc", retmode="xml", parsed = TRUE)

That gives you an easier time of finding the references

 refs <- xpathSApply(fetch2, "//INSDSeq_references/INSDReference")
 refs

[[1]]
<INSDReference>
  <INSDReference_reference>1</INSDReference_reference>
  <INSDReference_authors>
    <INSDAuthor>Mori,C.</INSDAuthor>
    <INSDAuthor>Fujita,J.</INSDAuthor>
    <INSDAuthor>Tooriyama,T.</INSDAuthor>
    <INSDAuthor>Takahara,R.</INSDAuthor>
    <INSDAuthor>Takamizawa,A.</INSDAuthor>
  </INSDReference_authors>
  <INSDReference_title>Complete Nucleotide Sequence of the Mumps Virus Urabe Vaccine Strain Genomic cDNA</INSDReference_title>
  <INSDReference_journal>Rinsho To Uirusu 23, 341-352 (1995)</INSDReference_journal>
</INSDReference> 

[[2]]
<INSDReference>
  <INSDReference_reference>2</INSDReference_reference>
  <INSDReference_position>1..15385</INSDReference_position>
  <INSDReference_authors>
    <INSDAuthor>Mori,C.</INSDAuthor>
  </INSDReference_authors>
  <INSDReference_title>Direct Submission</INSDReference_title>
  <INSDReference_journal>Submitted (10-JAN-1997) Chisato Mori, The Research Foundation for Microbial Diseases of Osaka Univ ., Kanonji Institute, Research and Development Division; 2-9-41, Yahata-cho, Kanonji city, Kagawa 768, Japan (E-mail:biken-rd@niji.or.jp, Tel:0875-25-4171, Fax:0875-23-1660)</INSDReference_journal>
</INSDReference>

Or as a list

 xml_list2 <- xmlToList(fetch2)
 xml_list2$INSDSeq$INSDSeq_references

$INSDReference
$INSDReference$INSDReference_reference
[1] "1"

$INSDReference$INSDReference_authors
$INSDReference$INSDReference_authors$INSDAuthor
[1] "Mori,C."

$INSDReference$INSDReference_authors$INSDAuthor
[1] "Fujita,J."

$INSDReference$INSDReference_authors$INSDAuthor
[1] "Tooriyama,T."

$INSDReference$INSDReference_authors$INSDAuthor
[1] "Takahara,R."

$INSDReference$INSDReference_authors$INSDAuthor
[1] "Takamizawa,A."

$INSDReference$INSDReference_title
[1] "Complete Nucleotide Sequence of the Mumps Virus Urabe Vaccine Strain Genomic cDNA"

$INSDReference$INSDReference_journal
[1] "Rinsho To Uirusu 23, 341-352 (1995)"

$INSDReference
$INSDReference$INSDReference_reference
[1] "2"

$INSDReference$INSDReference_position
[1] "1..15385"

$INSDReference$INSDReference_authors
$INSDReference$INSDReference_authors$INSDAuthor
[1] "Mori,C."

$INSDReference$INSDReference_title
[1] "Direct Submission"

$INSDReference$INSDReference_journal
[1] "Submitted (10-JAN-1997) Chisato Mori, The Research Foundation for Microbial Diseases of Osaka Univ ., Kanonji Institute, Research and Development Division; 2-9-41, Yahata-cho, Kanonji city, Kagawa 768, Japan (E-mail:biken-rd@niji.or.jp, Tel:0875-25-4171, Fax:0875-23-1660)"

Hopefully those records are a bit easier to handle? If you come up with a solution and you are happy to share it I'd love to add it to the wiki, so let me know.

dwinter commented 6 years ago

Also, I just remembered that @bomeara and a few others tried to build a tool for exactly this task at an ievobio meeting. Here is the repo, don't know if it will be any help in this case https://github.com/bomeara/genbankcredit

arw36 commented 6 years ago

Thanks @dwinter I will work try with your subset method, definitely makes it clearer parse job.

Also, in general this package is so useful and I really appreciate it :)

arw36 commented 6 years ago

Ok, here was my solution:

# function to add column to df if not already included
fncols <- function(data, cname) {
  add <-cname[!cname%in%names(data)]
  if(length(add)!=0) data[add] <- NA
  data
}
# function to collate all publications associated with sequences
get_pub_info <- function(i){
  fetch2 <- entrez_fetch(db = "nucleotide", id = i, 
                       rettype = "gbc", retmode="xml", parsed = TRUE)
  xml_list2 <- xmlToList(fetch2)
  ref_list <- xml_list2$INSDSeq$INSDSeq_references
  # extract publication fields info
  authors <- unlist(ref_list$INSDReference$INSDReference_authors) %>% paste(collapse = "; ")
  title <- ref_list$INSDReference$INSDReference_title
  journal <- ref_list$INSDReference$INSDReference_journal
  year <-gsub(".*\\((.*)\\).*", "\\1", journal)
  pm_id <- ref_list$INSDReference$INSDReference_pubmed
  remark <- ref_list$INSDReference$INSDReference_remark
  # create data frame of information
  pub.data <- data.frame(i, authors, journal, year) 
  if(is.null(title)==FALSE) pub.data$title <- title
  if(is.null(pm_id)==FALSE) pub.data$pubmed_id <- pm_id
  if(is.null(remark)==FALSE) pub.data$remark <- remark
  pub.data <- fncols(pub.data, c("title", "pubmed_id", "remark"))
}
sequence_list <- c("AB687721.2", "AB600942.1", "AJ880277.1")
list_of_dfs <- lapply(sequence_list, get_pub_info) # run function on list of sequences
df_combine <- bind_rows(list_of_dfs)
colnames(df_combine)[1] <- "NCBI_idv"
df_combine <- tidyr::separate(df_combine, remark, c("text", "doi"), sep = "DOI:") # extract doi
df_combine <- tidyr::separate(df_combine, doi, c("doi", "text2"), sep = ";")
df_combine$remark <- paste(df_combine$text,df_combine$text2)
df_combine$text <- NULL
df_combine$text2 <- NULL

This outputs a nice dataframe with publication information (I had list of couple thousand sequences I needed info on), or shows that a sequence is unpublished. I plan on then running this through the metagear package to extract fulltexts

ropensci / rentrez

functions to clearly parse Nucleotide database efetch? #113