r-lib / xml2

Bindings to libxml2
https://xml2.r-lib.org/
Other
218 stars 83 forks source link

print(xml_document) does not scale with large documents #441

Open zeehio opened 6 months ago

zeehio commented 6 months ago

When loading a fairly large XML file (~500MB), if I print() the document it takes a long time, and it is not interruptible.

However printing the children nodes individually is fast.

I believe the reprex below eventually calls show_nodes which calls as.character here, that takes a long time and blocks the interpreter.

https://github.com/r-lib/xml2/blob/ab73051bca962d3cbcf8f4046eb2758be74a489c/R/xml_nodeset.R#L73

library(xml2)
# Download 490 MB:
if (!file.exists("cellosaurus.xml")) download.file("https://ftp.expasy.org/databases/cellosaurus/cellosaurus.xml", "cellosaurus.xml")
# Read XML:
cellosaurus_xml <- xml2::read_xml("cellosaurus.xml")

# My print (a fast version, closer to what I would expect)

cat(format(cellosaurus_xml))
#> <Cellosaurus>
children <- xml2:::xml_children(cellosaurus_xml)
for (child in children) {
  cat(format(child), "\n")
  xml2:::show_nodes(xml2:::xml_children(child))
}
#> <header> 
#> [1] <terminology-name>Cellosaurus</terminology-name>
#> [2] <description>Cellosaurus: a controlled vocabulary of cell lines</descript ...
#> [3] <release version="48.0" updated="2024-01-30" nb-cell-lines="152231" nb-pu ...
#> [4] <terminology-list>\n  <terminology name="NCBI-Taxonomy" source="National  ...
#> <cell-line-list> 
#>  [1] <cell-line category="Hybridoma" created="2021-09-23" last-updated="2024- ...
#>  [2] <cell-line category="Hybridoma" created="2021-09-23" last-updated="2024- ...
#>  [3] <cell-line category="Transformed cell line" created="2012-10-22" last-up ...
#>  [4] <cell-line category="Hybridoma" created="2017-08-22" last-updated="2023- ...
#>  [5] <cell-line category="Cancer cell line" created="2017-05-15" last-updated ...
#>  [6] <cell-line category="Hybridoma" created="2012-06-06" last-updated="2023- ...
#>  [7] <cell-line category="Hybridoma" created="2014-07-17" last-updated="2023- ...
#>  [8] <cell-line category="Hybridoma" created="2022-12-15" last-updated="2023- ...
#>  [9] <cell-line category="Transformed cell line" created="2012-10-22" last-up ...
#> [10] <cell-line category="Hybridoma" created="2013-02-11" last-updated="2023- ...
#> [11] <cell-line category="Cancer cell line" created="2018-05-14" last-updated ...
#> [12] <cell-line category="Finite cell line" created="2012-04-04" last-updated ...
#> [13] <cell-line category="Finite cell line" created="2012-04-04" last-updated ...
#> [14] <cell-line category="Finite cell line" created="2013-11-05" last-updated ...
#> [15] <cell-line category="Finite cell line" created="2012-04-04" last-updated ...
#> [16] <cell-line category="Cancer cell line" created="2012-04-04" last-updated ...
#> [17] <cell-line category="Cancer cell line" created="2012-04-04" last-updated ...
#> [18] <cell-line category="Spontaneously immortalized cell line" created="2019 ...
#> [19] <cell-line category="Transformed cell line" created="2021-12-16" last-up ...
#> [20] <cell-line category="Cancer cell line" created="2024-01-30" last-updated ...
#> ...
#> <publication-list> 
#>  [1] <publication date="2005" type="article" journal-name="AAPS J." volume="7 ...
#>  [2] <publication date="2011" type="article" journal-name="AAPS J." volume="1 ...
#>  [3] <publication date="2011" type="article" journal-name="AAPS J." volume="1 ...
#>  [4] <publication date="2016" type="article" journal-name="AAPS J." volume="1 ...
#>  [5] <publication date="2000" type="article" journal-name="AAPS PharmSci" vol ...
#>  [6] <publication date="2004" type="article" journal-name="AAPS PharmSci" vol ...
#>  [7] <publication date="2008" type="article" journal-name="ACS Chem. Biol." v ...
#>  [8] <publication date="2014" type="article" journal-name="ACS Chem. Biol." v ...
#>  [9] <publication date="2018" type="article" journal-name="ACS Infect. Dis."  ...
#> [10] <publication date="2023" type="article" journal-name="ACS Materials Au"  ...
#> [11] <publication date="2022" type="article" journal-name="ACS Omega" volume= ...
#> [12] <publication date="2017" type="article" journal-name="ACS Synth. Biol."  ...
#> [13] <publication date="2001" type="article" journal-name="Acta Astronaut." v ...
#> [14] <publication date="2013" type="article" journal-name="Acta Astronaut." v ...
#> [15] <publication date="2005" type="article" journal-name="Acta Biochim. Biop ...
#> [16] <publication date="2004" type="article" journal-name="Acta Biochim. Pol. ...
#> [17] <publication date="1988" type="article" journal-name="Acta Biol. Hung."  ...
#> [18] <publication date="2015" type="article" journal-name="Acta Biol. Hung."  ...
#> [19] <publication date="2016" type="article" journal-name="Acta Crystallogr.  ...
#> [20] <publication date="2001" type="article" journal-name="Acta Cytol." volum ...
#> ...
#> <copyright>

# This is extremely slow, and non-interruptible:
# print(cellosaurus_xml)

Created on 2024-03-12 with reprex v2.1.0

Is this expected? Or should the print() function scale better with larger XML files?