ropensci / rentrez

talk with NCBI entrez using R
https://docs.ropensci.org/rentrez
Other
195 stars 38 forks source link

entrez_fetch does not return XML when querying protein database #91

Closed grepinsight closed 7 years ago

grepinsight commented 7 years ago

Hi. Thanks for making a nice package to interact with NCBI. While playing with this package, I stumbled upon the following problem where entrez_fetch does not seem to retrieve XML when querying against protein database.

> protein_id <- entrez_search(db="protein", term="TLR3[All Fields] AND Human[Organism] AND refseq", retmax=150)$ids
> xmlrec <- entrez_fetch(db="protein", id=protein_id, rettype="xml", parsed=TRUE)
Error: XML content does not seem to be XML: 'Seq-entry ::= set {
  level 1 ,
  class nuc-prot ,
  descr {
    source {
      genome genomic ,
      org {
        taxname "Homo sapiens" ,
        common "human" ,
        db {
          {
            db "taxon" ,
            tag
              id 9606 } } ,
        syn {
          "humans" ,
          "man" } ,
        orgname {
          name
            binomial {
              genus "Homo" ,
              species "sapiens" } ,
          lineage "Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
 Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
 Catarrhini; Hominidae; Homo" ,
          gcode 1 ,
          mgcode 2 ,
          div "PRI" } } ,
      subtype {
        {
          subtype chromosome ,
          name "8" } ,
        {
          subtype map ,
          name "8q24.12" } } } ,
    pub {
      pub {
        pmid 26727116 ,
        article {
          title {
            name "Serum Autotaxin/ENPP2 co

This problem does not occur in gene database

gene_id <- entrez_search(db="gene", term="TLR3[All Fields] AND Human[Organism] AND refseq", retmax=150)$ids
xmlrec <- entrez_fetch(db="gene", id=gene_id, rettype="xml", parsed=TRUE)
dwinter commented 7 years ago

Hi @grepinsight,

I think this is a recent change from the NCBI, partly covered by this commit https://github.com/ropensci/rentrez/commit/5b2785a10aeb254495f3737bf9dcf195e446ff9d (not yet part of a CRAN release).

For now, you can install github version (with devtools::install_github) and use the "native" format for rettype:

xmlrec <- entrez_fetch(db="protein", id=protein_id, retmode="native", parsed=TRUE)

The full list of rettypes available for each databse is here https://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/

(Leaving this open until I update documentation to note this)