Merge redundant docx nodes to reduce memory footprint

alecgibson commented 4 years ago

It's possible to have some docx files with repeated, redundant styling applied on every word, like so:

<w:r w:rsidRPr="00074A08">
  <w:rPr>
    <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/>
    <w:color w:val="000000"/>
  </w:rPr>
  <w:t>without</w:t>
</w:r>
<w:r w:rsidR="00C734B2" w:rsidRPr="00074A08">
  <w:rPr>
    <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/>
    <w:color w:val="000000"/>
  </w:rPr>
  <w:t xml:space="preserve"></w:t>
</w:r>
<w:r w:rsidRPr="00074A08">
  <w:rPr>
    <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/>
    <w:color w:val="000000"/>
  </w:rPr>
  <w:t>backup.</w:t>
</w:r>

When running these files through Pandoc, it consumes a vast amount of memory (>2GB when processing an 80k word document).

In contrast, if we copy-paste the contents of this file into a "fresh" document in MS Word and save, running the new document through Pandoc only consumes ~100MB memory.

Is there any way for Pandoc to be a bit "smarter" when building its AST to find these repeated nodes, and merge them in order to reduce the memory footprint?

I realise that the workaround is trivial, but we're trying to deal with arbitrary user input (always exciting), and technically this is a valid way of representing a document (if also a bit stupid), and it would be great if Pandoc could cope with this in a sensible way.

Pandoc version

pandoc 2.7.2
Compiled with pandoc-types 1.17.5.4, texmath 0.11.2.2, skylighting 0.7.7

Console output

pandoc +RTS -s -h -RTS --from=docx --to=json --out=test.json example.docx
  23,178,301,504 bytes allocated in the heap
  11,219,862,864 bytes copied during GC
   2,350,551,128 bytes maximum residency (16 sample(s))
       5,922,728 bytes maximum slop
            2241 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     22022 colls,     0 par    4.509s   4.563s     0.0002s    0.0013s
  Gen  1        16 colls,     0 par    6.978s   9.773s     0.6108s    3.2935s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0(0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.005s elapsed)
  MUT     time    5.223s  (  5.938s elapsed)
  GC      time   11.487s  ( 14.336s elapsed)
  EXIT    time    0.000s  (  0.006s elapsed)
  Total   time   16.711s  ( 20.286s elapsed)

  Alloc rate    4,437,656,505 bytes per MUT second

  Productivity  31.3% of total user, 29.3% of total elapsed

Heap Profile

Test document

example.docx

jgm commented 4 years ago

@jkr does this suggest anything?

tarleb commented 3 years ago

@alecgibson, would you upload a test file which we can use to analyze the problem?

alecgibson commented 3 years ago

@tarleb there's one in the original post; does that not work?

tarleb commented 3 years ago

Argh, I'm just blind. Thanks!

mpickering commented 3 years ago

I had a look into this.

It seems that the xml parsing library is not very well optimised so that there are a lot of thunks which are left in the AST once the XML parsing has finished. I'm not sure if there's a better (more optimised) alternative.
Almost all the allocations come from XML parsing, I think it just uses more memory because the file is much bigger because of all the redundant styling. It's not clear though why it needs to allocate gbs of (,) and : constructors for a 250kb file.
I added some more strictness to the xml library which flattened the profile after the initial parsing because the whole input stream is not retained. As far as I could tell, the whole input stream was being retained because the attributes field was never forced.

profiles.zip

mpickering commented 3 years ago

I made some strictness changes to xml, which halves the maximum residency.

https://github.com/mpickering/xml/commit/ba346a81def0b57a1551ac19420fa8f10b2421a1

profiles2.zip

These are just all my changes I made, it's not scientific about which ones are necessary or not.

I'm not sure what's best to do here, I wouldn't personally want to rely on the xml library.

mb21 commented 3 years ago

Nice work @mpickering! Well, pandoc uses that lib in quite a few places... so guess it's either:

improve performance in upstream (the lib) by introducing more strictness, and potentially also replacing String with Text
do the above but fork it
write an API-compatible wrapper around another xml lib... do you know of a good and small one?

The nice thing about xml is that it's quite minimal, so doing 1. or 2. potentially sounds like less hassle?

tarleb commented 3 years ago

Thanks @mpickering, this is great!

write an API-compatible wrapper around another xml lib... do you know of a good and small one?

The only real contenders for parsing seem to be xml-conduit and tagsoup, but we're also using xml for writing XML. So replacing it is difficult.

Anyhow, could tackling this make a good Summer of Code student project for next year?

mpickering commented 3 years ago

Perhaps xeno is another option?

jgm commented 3 years ago

citeproc uses xml-conduit. pandoc depends on citeproc. So we could use xml-conduit instead of xml without incurring any more dependencies. And it has a renderer.

jgm commented 3 years ago

But improving the xml library by submitting patches upstream seems a good idea to me in any case. It is also used by texmath and 110 other packages: https://packdeps.haskellers.com/reverse/xml So improving it could really help the whole ecosystem (assuming this is not one of those cases where the extra strictness sometimes helps and sometimes hurts...)

milahu commented 3 years ago

in my case, this blocks text-extraction from docx related: https://github.com/Microsoft/Simplify-Docx https://github.com/mwilliamson/mammoth.js (does merge adjacent text nodes) https://stackoverflow.com/questions/7752932/simplify-clean-up-xml-of-a-docx-word-document

jgm commented 3 years ago

Note that recent versions of pandoc use a different xml parsing library than the one that was used in 2.7 (the version originally tested in the above report). I would expect performance would be much better.

jgm commented 3 years ago

OK, just tested with pandoc 2.14.0.1.

  34,687,033,432 bytes allocated in the heap
   5,041,403,792 bytes copied during GC
     889,977,368 bytes maximum residency (13 sample(s))
       5,154,280 bytes maximum slop
            1999 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      4119 colls,     0 par    2.698s   2.731s     0.0007s    0.0048s
  Gen  1        13 colls,     0 par    1.784s   2.281s     0.1755s    0.8247s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.005s elapsed)
  MUT     time    8.698s  (  8.759s elapsed)
  GC      time    4.483s  (  5.012s elapsed)
  EXIT    time    0.000s  (  0.010s elapsed)
  Total   time   13.182s  ( 13.786s elapsed)

  Alloc rate    3,987,722,584 bytes per MUT second

  Productivity  66.0% of total user, 63.5% of total elapsed

Some improvement here but not enough.

jgm commented 3 years ago

I tried adding StrictData to T.P.XML.Light.Types. This did not affect things much, actually made it a bit worse.

jgm commented 3 years ago

Here's a heap profile with 2.14.0.1.

jgm commented 3 years ago

Actually this does look like quite an improvement over the original heap profile.

jgm commented 3 years ago

A sample of the intermediate representation created by the docx reader before the AST is constructed:

PlainRun (Run (RunStyle {isBold = Nothing, isBoldCTL = Nothing, isItalic = Nothing, isItalicCTL = Nothing, isSmallCaps = Nothing, isStrike = Nothing, isRTL = Nothing, isForceCTL = Nothing, rVertAlign = Nothing, rUnderline = Nothing, rParentStyle = Nothing}) [TextRun "foo"]),PlainRun (Run (RunStyle {isBold = Nothing, isBoldCTL = Nothing, isItalic = Nothing, isItalicCTL = Nothing, isSmallCaps = Nothing, isStrike = Nothing, isRTL = Nothing, isForceCTL = Nothing, rVertAlign = Nothing, rUnderline = Nothing, rParentStyle = Nothing}) [TextRun " "])

and so on. One thing we could try would be doing a fusion operation on this representation (the Document structure produced by archiveToDocument), before it is converted to a Pandoc. I don't now if this would help. Probably a better approach would be to do the fusion in the process of parsing a Document.

jgm commented 3 years ago

I tried fusing the PlainRuns at the paragraph parsing phase; no help. I think that, as before, the problem is occuring in the XML parser.

jgm / pandoc