Open zeehio opened 8 months ago
When loading a fairly large XML file (~500MB), if I print() the document it takes a long time, and it is not interruptible.
print()
However printing the children nodes individually is fast.
I believe the reprex below eventually calls show_nodes which calls as.character here, that takes a long time and blocks the interpreter.
show_nodes
as.character
https://github.com/r-lib/xml2/blob/ab73051bca962d3cbcf8f4046eb2758be74a489c/R/xml_nodeset.R#L73
library(xml2) # Download 490 MB: if (!file.exists("cellosaurus.xml")) download.file("https://ftp.expasy.org/databases/cellosaurus/cellosaurus.xml", "cellosaurus.xml") # Read XML: cellosaurus_xml <- xml2::read_xml("cellosaurus.xml") # My print (a fast version, closer to what I would expect) cat(format(cellosaurus_xml)) #> <Cellosaurus> children <- xml2:::xml_children(cellosaurus_xml) for (child in children) { cat(format(child), "\n") xml2:::show_nodes(xml2:::xml_children(child)) } #> <header> #> [1] <terminology-name>Cellosaurus</terminology-name> #> [2] <description>Cellosaurus: a controlled vocabulary of cell lines</descript ... #> [3] <release version="48.0" updated="2024-01-30" nb-cell-lines="152231" nb-pu ... #> [4] <terminology-list>\n <terminology name="NCBI-Taxonomy" source="National ... #> <cell-line-list> #> [1] <cell-line category="Hybridoma" created="2021-09-23" last-updated="2024- ... #> [2] <cell-line category="Hybridoma" created="2021-09-23" last-updated="2024- ... #> [3] <cell-line category="Transformed cell line" created="2012-10-22" last-up ... #> [4] <cell-line category="Hybridoma" created="2017-08-22" last-updated="2023- ... #> [5] <cell-line category="Cancer cell line" created="2017-05-15" last-updated ... #> [6] <cell-line category="Hybridoma" created="2012-06-06" last-updated="2023- ... #> [7] <cell-line category="Hybridoma" created="2014-07-17" last-updated="2023- ... #> [8] <cell-line category="Hybridoma" created="2022-12-15" last-updated="2023- ... #> [9] <cell-line category="Transformed cell line" created="2012-10-22" last-up ... #> [10] <cell-line category="Hybridoma" created="2013-02-11" last-updated="2023- ... #> [11] <cell-line category="Cancer cell line" created="2018-05-14" last-updated ... #> [12] <cell-line category="Finite cell line" created="2012-04-04" last-updated ... #> [13] <cell-line category="Finite cell line" created="2012-04-04" last-updated ... #> [14] <cell-line category="Finite cell line" created="2013-11-05" last-updated ... #> [15] <cell-line category="Finite cell line" created="2012-04-04" last-updated ... #> [16] <cell-line category="Cancer cell line" created="2012-04-04" last-updated ... #> [17] <cell-line category="Cancer cell line" created="2012-04-04" last-updated ... #> [18] <cell-line category="Spontaneously immortalized cell line" created="2019 ... #> [19] <cell-line category="Transformed cell line" created="2021-12-16" last-up ... #> [20] <cell-line category="Cancer cell line" created="2024-01-30" last-updated ... #> ... #> <publication-list> #> [1] <publication date="2005" type="article" journal-name="AAPS J." volume="7 ... #> [2] <publication date="2011" type="article" journal-name="AAPS J." volume="1 ... #> [3] <publication date="2011" type="article" journal-name="AAPS J." volume="1 ... #> [4] <publication date="2016" type="article" journal-name="AAPS J." volume="1 ... #> [5] <publication date="2000" type="article" journal-name="AAPS PharmSci" vol ... #> [6] <publication date="2004" type="article" journal-name="AAPS PharmSci" vol ... #> [7] <publication date="2008" type="article" journal-name="ACS Chem. Biol." v ... #> [8] <publication date="2014" type="article" journal-name="ACS Chem. Biol." v ... #> [9] <publication date="2018" type="article" journal-name="ACS Infect. Dis." ... #> [10] <publication date="2023" type="article" journal-name="ACS Materials Au" ... #> [11] <publication date="2022" type="article" journal-name="ACS Omega" volume= ... #> [12] <publication date="2017" type="article" journal-name="ACS Synth. Biol." ... #> [13] <publication date="2001" type="article" journal-name="Acta Astronaut." v ... #> [14] <publication date="2013" type="article" journal-name="Acta Astronaut." v ... #> [15] <publication date="2005" type="article" journal-name="Acta Biochim. Biop ... #> [16] <publication date="2004" type="article" journal-name="Acta Biochim. Pol. ... #> [17] <publication date="1988" type="article" journal-name="Acta Biol. Hung." ... #> [18] <publication date="2015" type="article" journal-name="Acta Biol. Hung." ... #> [19] <publication date="2016" type="article" journal-name="Acta Crystallogr. ... #> [20] <publication date="2001" type="article" journal-name="Acta Cytol." volum ... #> ... #> <copyright> # This is extremely slow, and non-interruptible: # print(cellosaurus_xml)
Created on 2024-03-12 with reprex v2.1.0
Is this expected? Or should the print() function scale better with larger XML files?
When loading a fairly large XML file (~500MB), if I
print()
the document it takes a long time, and it is not interruptible.However printing the children nodes individually is fast.
I believe the reprex below eventually calls
show_nodes
which callsas.character
here, that takes a long time and blocks the interpreter.https://github.com/r-lib/xml2/blob/ab73051bca962d3cbcf8f4046eb2758be74a489c/R/xml_nodeset.R#L73
Created on 2024-03-12 with reprex v2.1.0
Is this expected? Or should the
print()
function scale better with larger XML files?