ocramz / xeno

Fast Haskell XML parser
Other
118 stars 32 forks source link

Xeno.DOM: Heap exhausted on a 5.6M file #65

Open unhammer opened 1 year ago

unhammer commented 1 year ago

longlines.xml.zip ↑ through xeno-dom exhaust heap memory. I just put the file into the list in SpeedBigFiles.hs as [ benchFile ["xeno-dom"] "6MB" "longlines.xml.bz2" and got

benchmarking 6M/xeno-dom
xeno-speed-big-files-bench: Heap exhausted;
xeno-speed-big-files-bench: Current maximum heap size is 26843545600 bytes (25600 MB).

Strangely, only minor changes to the file (e.g. sed 's/x/xx/gincreasing the file size) will let it through with about 800M maxresident (as reported by /usr/bin/time). Inserting newlines after each > we also get 800M maxresident, but it doesn't seem to be related to the long lines, as almost any change to the file helps.

(Yes I should be using Xeno.SAX, but why does e.g. https://dumps.wikimedia.org/nowiki/20230520/nowiki-20230520-pages-articles-multistream-index.txt.bz2 at 11M go through fine with <400M maxresident and this one not? Even removing newlines, the wiki works fine. This feels like leakage.)

ocramz commented 1 year ago

@unhammer perhaps you could try this test with the latest master ? see #63

unhammer commented 1 year ago

The issue remains :(

ocramz commented 1 year ago

"fy fan". Ok this requires some deeper thinking.

ocramz commented 1 year ago

@unhammer anyway, it's at least reassuring that the latest patch doesn't change the memory behavior of the library (kudos @mitchellwrosen )