gschofl / reutils

Talk to the NCBI EUtils
Other
20 stars 6 forks source link

efetch: XML parse error with db = "taxonomy" #13

Open sn0001 opened 4 years ago

sn0001 commented 4 years ago

Under Ubuntu 20.04, efetch() produces an error XML parse error: StartTag: invalid element name with certain taxon IDs, e.g.: efetch(174789, db = "taxonomy", retmode = "xml").

Session Info:

R version 3.5.3 (2019-03-11)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS: /opt/microsoft/ropen/3.5.3/lib64/R/lib/libRblas.so
LAPACK: /opt/microsoft/ropen/3.5.3/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] stats4    parallel  tools     stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] Hmisc_4.2-0                 ggplot2_3.1.1               Formula_1.2-3               survival_2.43-3            
 [5] lattice_0.20-38             doParallel_1.0.14           iterators_1.0.11            foreach_1.5.1              
 [9] batchtools_0.9.11           zoo_1.8-5                   XML_3.98-1.19               DECIPHER_2.14.0            
[13] RSQLite_2.1.1               ShortRead_1.40.0            GenomicAlignments_1.18.1    SummarizedExperiment_1.12.0
[17] DelayedArray_0.8.0          matrixStats_0.54.0          Biobase_2.42.0              Rsamtools_1.34.1           
[21] GenomicRanges_1.34.0        GenomeInfoDb_1.18.2         Biostrings_2.50.2           XVector_0.22.0             
[25] IRanges_2.16.0              S4Vectors_0.20.1            BiocParallel_1.16.6         BiocGenerics_0.28.0        
[29] fs_1.2.7                    data.table_1.12.2           assertthat_0.2.1            rjson_0.2.20               
[33] magrittr_1.5                reutils_0.2.3               optparse_1.6.2              RevoUtils_11.0.3           
[37] RevoUtilsMath_11.0.0       

loaded via a namespace (and not attached):
 [1] bitops_1.0-6           bit64_0.9-7            RColorBrewer_1.1-2     progress_1.2.0         backports_1.1.4       
 [6] R6_2.3.0               rpart_4.1-13           DBI_1.0.0              lazyeval_0.2.2         colorspace_1.4-1      
[11] nnet_7.3-12            withr_2.1.2            gridExtra_2.3          tidyselect_0.2.5       prettyunits_1.0.2     
[16] bit_1.1-14             compiler_3.5.3         htmlTable_1.13.1       pacman_0.5.1           scales_1.0.0          
[21] checkmate_1.9.1        rappdirs_0.3.1         stringr_1.4.0          digest_0.6.18          foreign_0.8-71        
[26] htmltools_0.3.6        base64enc_0.1-3        pkgconfig_2.0.2        htmlwidgets_1.3        rlang_0.3.4           
[31] rstudioapi_0.10        shiny_1.3.1            hwriter_1.3.2          acepack_1.4.1          dplyr_0.8.0.1         
[36] RCurl_1.95-4.12        GenomeInfoDbData_1.2.0 Matrix_1.2-15          Rcpp_1.0.1             munsell_0.5.0         
[41] stringi_1.4.3          debugme_1.1.0          zlibbioc_1.28.0        plyr_1.8.4             grid_3.5.3            
[46] blob_1.1.1             promises_1.0.1         crayon_1.3.4           splines_3.5.3          hms_0.4.2             
[51] knitr_1.22             pillar_1.3.1           base64url_1.4          codetools_0.2-16       glue_1.3.1            
[56] latticeExtra_0.6-28    httpuv_1.5.1           gtable_0.3.0           getopt_1.20.3          purrr_0.3.2           
[61] xfun_0.6               mime_0.6               xtable_1.8-3           later_0.8.0            tibble_2.1.1          
[66] memoise_1.1.0          cluster_2.0.7-1        brew_1.0-6            
gschofl commented 4 years ago

Can't reproduce it on my machine. Does the problem persist for you?

> x <- efetch(174789, db = "taxonomy", retmode = "xml")
> x
Object of class ‘efetch’ 
<?xml version="1.0"?>
<!DOCTYPE TaxaSet PUBLIC "-//NLM//DTD Taxon, 14th January 2002//EN" "https://www.ncbi.nlm.nih.gov/entrez/query/DTD/taxon.dtd">
<TaxaSet>
  <Taxon>
    <TaxId>174789</TaxId>
    <ScientificName>Ecpleopus gaudichaudii</ScientificName>
    <OtherNames>
      <Name>
        <ClassCDE>authority</ClassCDE>
        <DispName>Ecpleopus gaudichaudii Dume&#x301;ril &amp; Bibron, 1839</DispName>
      </Name>
      <Name>
        <ClassCDE>type material</ClassCDE>
        <DispName>BMNH 1946.8.2.4</DispName>
      </Name>
      <Name>
        <ClassCDE>type material</ClassCDE>
        <DispName>BMNH:1946.8.2.4</DispName>
      </Name>
    </OtherNames>
    <ParentTaxId>174747</ParentTaxId>
    <Rank>species</Rank>
    <Division>Vertebrates</Division>
    <GeneticCode>
      <GCId>1</GCId>
      <GCName>Standard</GCName>
    </GeneticCode>
    <MitoGeneticCode>
      <MGCId>2</MGCId>
      <MGCName>Vertebrate Mitochondrial</MGCName>
    </MitoGeneticCode>
    <Lineage>cellular organisms; Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria; Deuterostomia; Chordata; Craniata; Vertebrata; Gnathostomata; Teleostomi; Euteleostomi; Sarcopterygii; Dipnotetrapodomorpha; Tetrapoda; Amniota; Sauropsida; Sauria; Lepidosauria; Squamata; Bifurcata; Unidentata; Episquamata; Laterata; Teiioidea; Gymnophthalmidae; Cercosaurinae; Ecpleopus</Lineage>
    <LineageEx>
      <Taxon>
        <TaxId>131567</TaxId>
        <ScientificName>cellular organisms</ScientificName>
        <Rank>no rank</Rank>
      </Taxon>
      <Taxon>
        <TaxId>2759</TaxId>
        <ScientificName>Eukaryota</ScientificName>
        <Rank>superkingdom</Rank>
      </Taxon>
      <Taxon>
        <TaxId>33154</TaxId>
        <ScientificName>Opisthokonta</ScientificName>
        <Rank>clade</Rank>
      </Taxon>
      <Taxon>
        <TaxId>33208</TaxId>
        <ScientificName>Metazoa</ScientificName>
        <Rank>kingdom</Rank>
      </Taxon>
      <Taxon>
        <TaxId>6072</TaxId>
        <ScientificName>Eumetazoa</ScientificName>
        <Rank>clade</Rank>
      </Taxon>
      <Taxon>
        <TaxId>33213</TaxId>
        <ScientificName>Bilateria</ScientificName>
        <Rank>clade</Rank>
      </Taxon>
      <Taxon>
        <TaxId>33511</TaxId>
        <ScientificName>Deuterostomia</ScientificName>
        <Rank>clade</Rank>
      </Taxon>
      <Taxon>
        <TaxId>7711</TaxId>
        <ScientificName>Chordata</ScientificName>
        <Rank>phylum</Rank>
      </Taxon>
      <Taxon>
        <TaxId>89593</TaxId>
        <ScientificName>Craniata</ScientificName>
        <Rank>subphylum</Rank>
      </Taxon>
      <Taxon>
        <TaxId>7742</TaxId>
        <ScientificName>Vertebrata</ScientificName>
        <Rank>clade</Rank>
      </Taxon>
      <Taxon>
        <TaxId>7776</TaxId>
        <ScientificName>Gnathostomata</ScientificName>
        <Rank>clade</Rank>
      </Taxon>
      <Taxon>
        <TaxId>117570</TaxId>
        <ScientificName>Teleostomi</ScientificName>
        <Rank>clade</Rank>
      </Taxon>
      <Taxon>
        <TaxId>117571</TaxId>
        <ScientificName>Euteleostomi</ScientificName>
        <Rank>clade</Rank>
      </Taxon>
      <Taxon>
        <TaxId>8287</TaxId>
        <ScientificName>Sarcopterygii</ScientificName>
        <Rank>superclass</Rank>
      </Taxon>
      <Taxon>
        <TaxId>1338369</TaxId>
        <ScientificName>Dipnotetrapodomorpha</ScientificName>
        <Rank>clade</Rank>
      </Taxon>
      <Taxon>
        <TaxId>32523</TaxId>
        <ScientificName>Tetrapoda</ScientificName>
        <Rank>clade</Rank>
      </Taxon>
      <Taxon>
        <TaxId>32524</TaxId>
        <ScientificName>Amniota</ScientificName>
        <Rank>clade</Rank>
      </Taxon>
      <Taxon>
        <TaxId>8457</TaxId>
        <ScientificName>Sauropsida</ScientificName>
        <Rank>clade</Rank>
      </Taxon>
      <Taxon>
        <TaxId>32561</TaxId>
        <ScientificName>Sauria</ScientificName>
        <Rank>clade</Rank>
      </Taxon>
      <Taxon>
        <TaxId>8504</TaxId>
        <ScientificName>Lepidosauria</ScientificName>
        <Rank>class</Rank>
      </Taxon>
      <Taxon>
        <TaxId>8509</TaxId>
        <ScientificName>Squamata</ScientificName>
        <Rank>order</Rank>
      </Taxon>
      <Taxon>
        <TaxId>1329961</TaxId>
        <ScientificName>Bifurcata</ScientificName>
        <Rank>clade</Rank>
      </Taxon>
      <Taxon>
        <TaxId>1329950</TaxId>
        <ScientificName>Unidentata</ScientificName>
        <Rank>clade</Rank>
      </Taxon>
      <Taxon>
        <TaxId>1329912</TaxId>
        <ScientificName>Episquamata</ScientificName>
        <Rank>clade</Rank>
      </Taxon>
      <Taxon>
        <TaxId>1329976</TaxId>
        <ScientificName>Laterata</ScientificName>
        <Rank>clade</Rank>
      </Taxon>
      <Taxon>
        <TaxId>35036</TaxId>
        <ScientificName>Teiioidea</ScientificName>
        <Rank>superfamily</Rank>
      </Taxon>
      <Taxon>
        <TaxId>88861</TaxId>
        <ScientificName>Gymnophthalmidae</ScientificName>
        <Rank>family</Rank>
      </Taxon>
      <Taxon>
        <TaxId>2293894</TaxId>
        <ScientificName>Cercosaurinae</ScientificName>
        <Rank>subfamily</Rank>
      </Taxon>
      <Taxon>
        <TaxId>174747</TaxId>
        <ScientificName>Ecpleopus</ScientificName>
        <Rank>genus</Rank>
      </Taxon>
    </LineageEx>
    <CreateDate>2001/10/22 13:38:00</CreateDate>
    <UpdateDate>2020/06/02 23:02:56</UpdateDate>
    <PubDate>2002/01/10 18:04:00</PubDate>
  </Taxon>
</TaxaSet>

EFetch query using the ‘taxonomy’ database.
Query url: ‘https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?=efetch&db=taxonomy&id=174789&retmode=xm...’
Retrieval type: ‘’, retrieval mode: ‘xml’
sn0001 commented 4 years ago

Thank you for your reply. Yes, the problem persists. Maybe you are using different package versions?

Plus, under Windows, I can no longer execute any efetch() query:

> reutils::efetch(174789, db = "taxonomy", retmode = "xml")
Warnung: CurlError: error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version
Object of class ‘efetch’ 
[1] "CurlError: error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version"
EFetch query using the ‘taxonomy’ database.
Query url: ‘https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?=efetch&db=taxonomy&id=174789&retmode=xml&rettype=&ret...’
Retrieval type: ‘’, retrieval mode: ‘xml’
GillesSanMartin commented 1 year ago

Same problem here with another taxon

The 3 following options fail :

reutils::efetch(1965393, db = "taxonomy", retmode = "xml")

efetch_url <- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&rettype=xml&id=1965393"
my_xml <- efetch_url |>  RCurl::getURL() |>  XML::xmlParse(asText=TRUE, encoding = "UTF-8")
my_xml <- efetch_url |>  RCurl::getURL() |>  xml2::read_xml()

Here are the error messages :

> reutils::efetch(1965393, db = "taxonomy", retmode = "xml")
Error:
    XML parse error: StartTag: invalid element name

Object of class ‘efetch’ 
[1] "XML parse error: StartTag: invalid element name\n"
EFetch query using the ‘taxonomy’ database.
Query url: ‘https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?=efetch&db=taxonomy&id=1965393&retmode=xml&ret...’
Retrieval type: ‘’, retrieval mode: ‘xml’

> efetch_url <- "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&rettype=xml&id=1965393"
> my_xml <- efetch_url |>  RCurl::getURL() |>  XML::xmlParse(asText=TRUE, encoding = "UTF-8")
StartTag: invalid element name
Erreur : 1: StartTag: invalid element name
> my_xml <- efetch_url |>  RCurl::getURL() |>  xml2::read_xml()
Erreur dans read_xml.raw(charToRaw(enc2utf8(x)), "UTF-8", ..., as_html = as_html,  : 
  StartTag: invalid element name [68]

Using httr::GET as proposed here works fine :

# This works !
my_xml <- efetch_url |> httr::GET() |> httr::content("text") |> XML::xmlParse()