Open alecgibson opened 4 years ago
@jkr does this suggest anything?
@alecgibson, would you upload a test file which we can use to analyze the problem?
@tarleb there's one in the original post; does that not work?
Argh, I'm just blind. Thanks!
I had a look into this.
(,)
and :
constructors for a 250kb file. xml
library which flattened the profile after the initial parsing because the whole input stream is not retained. As far as I could tell, the whole input stream was being retained because the attributes field was never forced. I made some strictness changes to xml
, which halves the maximum residency.
https://github.com/mpickering/xml/commit/ba346a81def0b57a1551ac19420fa8f10b2421a1
These are just all my changes I made, it's not scientific about which ones are necessary or not.
I'm not sure what's best to do here, I wouldn't personally want to rely on the xml
library.
Nice work @mpickering! Well, pandoc uses that lib in quite a few places... so guess it's either:
String
with Text
The nice thing about xml
is that it's quite minimal, so doing 1. or 2. potentially sounds like less hassle?
Thanks @mpickering, this is great!
write an API-compatible wrapper around another xml lib... do you know of a good and small one?
The only real contenders for parsing seem to be xml-conduit and tagsoup, but we're also using xml
for writing XML. So replacing it is difficult.
Anyhow, could tackling this make a good Summer of Code student project for next year?
Perhaps xeno
is another option?
citeproc
uses xml-conduit
. pandoc
depends on citeproc
. So we could use xml-conduit
instead of xml
without incurring any more dependencies. And it has a renderer.
But improving the xml
library by submitting patches upstream seems a good idea to me in any case. It is also used by texmath and 110 other packages:
https://packdeps.haskellers.com/reverse/xml
So improving it could really help the whole ecosystem (assuming this is not one of those cases where the extra strictness sometimes helps and sometimes hurts...)
in my case, this blocks text-extraction from docx related: https://github.com/Microsoft/Simplify-Docx https://github.com/mwilliamson/mammoth.js (does merge adjacent text nodes) https://stackoverflow.com/questions/7752932/simplify-clean-up-xml-of-a-docx-word-document
Note that recent versions of pandoc use a different xml parsing library than the one that was used in 2.7 (the version originally tested in the above report). I would expect performance would be much better.
OK, just tested with pandoc 2.14.0.1.
34,687,033,432 bytes allocated in the heap
5,041,403,792 bytes copied during GC
889,977,368 bytes maximum residency (13 sample(s))
5,154,280 bytes maximum slop
1999 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 4119 colls, 0 par 2.698s 2.731s 0.0007s 0.0048s
Gen 1 13 colls, 0 par 1.784s 2.281s 0.1755s 0.8247s
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.005s elapsed)
MUT time 8.698s ( 8.759s elapsed)
GC time 4.483s ( 5.012s elapsed)
EXIT time 0.000s ( 0.010s elapsed)
Total time 13.182s ( 13.786s elapsed)
Alloc rate 3,987,722,584 bytes per MUT second
Productivity 66.0% of total user, 63.5% of total elapsed
Some improvement here but not enough.
I tried adding StrictData to T.P.XML.Light.Types. This did not affect things much, actually made it a bit worse.
Here's a heap profile with 2.14.0.1.
Actually this does look like quite an improvement over the original heap profile.
A sample of the intermediate representation created by the docx reader before the AST is constructed:
PlainRun (Run (RunStyle {isBold = Nothing, isBoldCTL = Nothing, isItalic = Nothing, isItalicCTL = Nothing, isSmallCaps = Nothing, isStrike = Nothing, isRTL = Nothing, isForceCTL = Nothing, rVertAlign = Nothing, rUnderline = Nothing, rParentStyle = Nothing}) [TextRun "foo"]),PlainRun (Run (RunStyle {isBold = Nothing, isBoldCTL = Nothing, isItalic = Nothing, isItalicCTL = Nothing, isSmallCaps = Nothing, isStrike = Nothing, isRTL = Nothing, isForceCTL = Nothing, rVertAlign = Nothing, rUnderline = Nothing, rParentStyle = Nothing}) [TextRun " "])
and so on. One thing we could try would be doing a fusion operation on this representation (the Document
structure produced by archiveToDocument
), before it is converted to a Pandoc
. I don't now if this would help.
Probably a better approach would be to do the fusion in the process of parsing a Document
.
I tried fusing the PlainRuns at the paragraph parsing phase; no help. I think that, as before, the problem is occuring in the XML parser.
It's possible to have some
docx
files with repeated, redundant styling applied on every word, like so:When running these files through Pandoc, it consumes a vast amount of memory (>2GB when processing an 80k word document).
In contrast, if we copy-paste the contents of this file into a "fresh" document in MS Word and save, running the new document through Pandoc only consumes ~100MB memory.
Is there any way for Pandoc to be a bit "smarter" when building its AST to find these repeated nodes, and merge them in order to reduce the memory footprint?
I realise that the workaround is trivial, but we're trying to deal with arbitrary user input (always exciting), and technically this is a valid way of representing a document (if also a bit stupid), and it would be great if Pandoc could cope with this in a sensible way.
Pandoc version
Console output
Heap Profile
Test document
example.docx