ropensci / rentrez

talk with NCBI entrez using R
https://docs.ropensci.org/rentrez
Other
194 stars 38 forks source link

entrez_summary does not retrieve all the record #177

Closed tdyoshida closed 2 years ago

tdyoshida commented 2 years ago

Thanks for developing an excellent tool!

I'm having trouble retrieving Entrez summary record using entrez_summary. For some reason it does not retrieve summary information even when it is there. Below is an example.

# Retrieve records of id = 1146 and 11449; they are orthologous genes in human and mouse, respectively.
sum_1146 <- entrez_summary(db = "gene", id = 1146)
sum_1146$name
[1] "CHRNG" # Successfully retrieved the correct gene.

sum_1146$summary
[1] "The mammalian muscle-type acetylcholine receptor is a transmembrane pentameric glycoprotein with two alpha subunits, one beta, one delta, and one epsilon (in adult skeletal muscle) or gamma (in fetal and denervated muscle) subunit. This gene, which encodes the gamma subunit, is expressed prior to the thirty-third week of gestation in humans. The gamma subunit of the acetylcholine receptor plays a role in neuromuscular organogenesis and ligand binding and disruption of gamma subunit expression prevents the correct localization of the receptor in cell membranes. Mutations in this gene cause Escobar syndrome and a lethal form of multiple pterygium syndrome. Muscle-type acetylcholine receptor is the major antigen in the autoimmune disease myasthenia gravis.[provided by RefSeq, Sep 2009]"

sum_11449 <- entrez_summary(db = "gene", id = 11449)
sum_11449$name
[1] "Chrng"  # Successfully retrieved the correct gene.

sum_11449$summary
[1] "" # No information.

Below are the Entrez links and Summary for these genes.
The summary of ID=1146 matches to the retrieved information above. However, even though ID=11449 has Summary information as shown below, it was not in the entrez_summary result.

I appreciate if you could help me on why this is happening and how to fix the problem.

id: 1146 Summary: The mammalian muscle-type acetylcholine receptor is a transmembrane pentameric glycoprotein with two alpha subunits, one beta, one delta, and one epsilon (in adult skeletal muscle) or gamma (in fetal and denervated muscle) subunit. This gene, which encodes the gamma subunit, is expressed prior to the thirty-third week of gestation in humans. The gamma subunit of the acetylcholine receptor plays a role in neuromuscular organogenesis and ligand binding and disruption of gamma subunit expression prevents the correct localization of the receptor in cell membranes. Mutations in this gene cause Escobar syndrome and a lethal form of multiple pterygium syndrome. Muscle-type acetylcholine receptor is the major antigen in the autoimmune disease myasthenia gravis.[provided by RefSeq, Sep 2009]

id: 11449 Summary: Enables acetylcholine-gated cation-selective channel activity. Acts upstream of or within regulation of membrane potential. Located in postsynaptic membrane. Part of acetylcholine-gated channel complex. Is expressed in several structures, including diaphragm; embryo mesenchyme; limb bud; skeletal musculature; and tongue. Human ortholog(s) of this gene implicated in multiple pterygium syndrome. Orthologous to human CHRNG (cholinergic receptor nicotinic gamma subunit). [provided by Alliance of Genome Resources, Nov 2021]

packageVersion('rentrez')
[1] ‘1.2.3’

sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.2.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rentrez_1.2.3

loaded via a namespace (and not attached):
[1] httr_1.4.2     compiler_4.1.1 R6_2.5.1       tools_4.1.1    curl_4.3.2     jsonlite_1.8.0 XML_3.99-0.9  
str(sum_1146)
List of 21
 $ uid               : chr "1146"
 $ name              : chr "CHRNG"
 $ description       : chr "cholinergic receptor nicotinic gamma subunit"
 $ status            : chr ""
 $ currentid         : chr ""
 $ chromosome        : chr "2"
 $ geneticsource     : chr "genomic"
 $ maplocation       : chr "2q37.1"
 $ otheraliases      : chr "ACHRG"
 $ otherdesignations : chr "acetylcholine receptor subunit gamma|acetylcholine receptor, muscle, gamma subunit|acetylcholine receptor, nico"| __truncated__
 $ nomenclaturesymbol: chr "CHRNG"
 $ nomenclaturename  : chr "cholinergic receptor nicotinic gamma subunit"
 $ nomenclaturestatus: chr "Official"
 $ mim               : chr "100730"
 $ genomicinfo       :'data.frame': 1 obs. of  5 variables:
  ..$ chrloc   : chr "2"
  ..$ chraccver: chr "NC_000002.12"
  ..$ chrstart : int 232539691
  ..$ chrstop  : int 232548114
  ..$ exoncount: int 12
 $ geneweight        : int 2824
 $ summary           : chr "The mammalian muscle-type acetylcholine receptor is a transmembrane pentameric glycoprotein with two alpha subu"| __truncated__
 $ chrsort           : chr "02"
 $ chrstart          : int 232539691
 $ organism          :List of 3
  ..$ scientificname: chr "Homo sapiens"
  ..$ commonname    : chr "human"
  ..$ taxid         : int 9606
 $ locationhist      :'data.frame': 15 obs. of  5 variables:
  ..$ annotationrelease: chr [1:15] "109.20211119" "109.20210514" "109.20210226" "109.20201120" ...
  ..$ assemblyaccver   : chr [1:15] "GCF_000001405.39" "GCF_000001405.39" "GCF_000001405.39" "GCF_000001405.39" ...
  ..$ chraccver        : chr [1:15] "NC_000002.12" "NC_000002.12" "NC_000002.12" "NC_000002.12" ...
  ..$ chrstart         : int [1:15] 232539691 232539691 232539691 232539691 232539691 232539691 232539691 232539691 232539691 232539691 ...
  ..$ chrstop          : int [1:15] 232548114 232548114 232548114 232548114 232548114 232548114 232548114 232548114 232548114 232548114 ...
 - attr(*, "class")= chr [1:2] "esummary" "list"
str(sum_11449)
List of 21
 $ uid               : chr "11449"
 $ name              : chr "Chrng"
 $ description       : chr "cholinergic receptor, nicotinic, gamma polypeptide"
 $ status            : chr ""
 $ currentid         : chr ""
 $ chromosome        : chr "1"
 $ geneticsource     : chr "genomic"
 $ maplocation       : chr "1 44.07 cM"
 $ otheraliases      : chr "Achr-3, Acrg"
 $ otherdesignations : chr "acetylcholine receptor subunit gamma|nicotinic acetylcholine receptor gamma subunit"
 $ nomenclaturesymbol: chr "Chrng"
 $ nomenclaturename  : chr "cholinergic receptor, nicotinic, gamma polypeptide"
 $ nomenclaturestatus: chr "Official"
 $ mim               : list()
 $ genomicinfo       :'data.frame': 1 obs. of  5 variables:
  ..$ chrloc   : chr "1"
  ..$ chraccver: chr "NC_000067.7"
  ..$ chrstart : int 87133532
  ..$ chrstop  : int 87139556
  ..$ exoncount: int 12
 $ geneweight        : int 5805
 $ summary           : chr ""
 $ chrsort           : chr "01"
 $ chrstart          : int 87133532
 $ organism          :List of 3
  ..$ scientificname: chr "Mus musculus"
  ..$ commonname    : chr "house mouse"
  ..$ taxid         : int 10090
 $ locationhist      :'data.frame': 5 obs. of  5 variables:
  ..$ annotationrelease: chr [1:5] "109" "108.20200622" "108" "37.2" ...
  ..$ assemblyaccver   : chr [1:5] "GCF_000001635.27" "GCF_000001635.26" "GCF_000001635.26" "GCF_000001635.18" ...
  ..$ chraccver        : chr [1:5] "NC_000067.7" "NC_000067.6" "NC_000067.6" "NC_000067.5" ...
  ..$ chrstart         : int [1:5] 87133532 87205810 87205810 89102385 90178868
  ..$ chrstop          : int [1:5] 87139556 87211834 87211834 89108409 90184964
 - attr(*, "class")= chr [1:2] "esummary" "list"
allenbaron commented 2 years ago

This does not appear to be an issue with rentrez. The same results are returned by esummary, the UNIX command line program that is part of Entrez Direct (esummary -db gene -id 11449 does not return the summary). You'll have to reach out to the developers of the E-utils API or maintainers of the gene database to get this fixed.

tdyoshida commented 2 years ago

Thank you so much for your quick response!

I see your point. I confirmed that EUtility does not retrieve Summary information of some (but not all) of the genes. I have contacted the development team of EUtility. Hopefully they can solve the issue.

tdyoshida commented 2 years ago

I have received a response from the developer, and I update the issue in case it is useful to somebody.

Comment from developmer

esummary is not supposed to provide the "Entrezgene_summary" field for all records (some do, but it is not clear why). However, efetch will return the summary field: efetch -db gene -id 11449 -format xml

Based on the comment above, I was able to retrieve the Summary info using entrez_fetch as follows.

ef_11449 <- entrez_fetch(db="gene", id=11449, rettype = "xml", parsed = TRUE) %>% 
  XML::xmlToList()

ef_11449$Entrezgene$Entrezgene_summary