jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.83k stars 3.39k forks source link

Large inline images use a lot of memory #10075

Open ebeigarts opened 3 months ago

ebeigarts commented 3 months ago

Here is an example HTML file (10Mb) with one embedded JPEG image (7.6Mb).

Memory usage:

Test examples:

pandoc --version
pandoc 3.3

pandoc +RTS -t -RTS -o test.md test.html
# <<ghc: 30556315048 bytes, 3666 GCs, 248986086/1293885312 avg/max bytes residency (13 samples), 2985M in use, 0.001 INIT (0.001 elapsed), 2.773 MUT (2.511 elapsed), 2.892 GC (3.461 elapsed) :ghc>>

pandoc +RTS -t -RTS -o test.docx test.md
# <<ghc: 105686485256 bytes, 12695 GCs, 434087025/1466902032 avg/max bytes residency (19 samples), 3435M in use, 0.002 INIT (0.002 elapsed), 10.349 MUT (10.101 elapsed), 7.732 GC (8.308 elapsed) :ghc>>

pandoc +RTS -t -RTS -o test.docx test.html
# <<ghc: 76105089872 bytes, 9099 GCs, 489199853/1886265928 avg/max bytes residency (20 samples), 4350M in use, 0.002 INIT (0.002 elapsed), 8.025 MUT (7.772 elapsed), 9.163 GC (10.023 elapsed) :ghc>>

OS: macOS 14.14.1, m3/arm

jgm commented 3 months ago

This does seem to have to do with inline (base64-encoded) images specifically. I tried the same file but with a linked image, and it only used 30 MB.

jgm commented 3 months ago

Odd that even html -> md takes a lot of memory, even though the URL should just be passed through unchanged.

jgm commented 3 months ago

If we do html -> json and then json -> md, that takes 956M for the first step and 810M for the second. So, it's neither a reader nor a writer issue exclusively. json -> html takes 1768M. But json -> json is fast.

I'd need to do some profiling to track this down further.

jgm commented 3 months ago

Profiling, first three entries for html -> md:

COST CENTRE             MODULE                           SRC                                                       %time %alloc

parseURI                Network.URI                      Network/URI.hs:301:1-26                                    17.4   21.5
escapeURI               Text.Pandoc.URI                  src/Text/Pandoc/URI.hs:(31,1)-(32,65)                      17.2   21.0
parseURIReference       Network.URI                      Network/URI.hs:308:1-44                                    16.1   21.5
jgm commented 3 months ago

parseURI gets called in both the reader and the writer, it seems. In the reader as part of canonicalizeUrl. IN the writer as part of isURI (which is used to check whether an "auto-link" should be used).

It seems that parseURI may be slow and could perhaps be optimized (it's in Network.URI so not part of pandoc).

We could also think about whether we could avoid calling it.

jgm commented 3 months ago

Yeah, here's the problem (network-uri): Network/URI.hs

The URI parser parses multiple segments and concatenates them. (Each segment is basically a path component starting with /). But look at the segment parser:

segment :: URIParser String
segment =
    do  { ps <- many pchar
        ; return $ concat ps
        }

This parses many small strings, one for each pchar (usually just one character!) and then concatenates them. I think that allocating thousands or millions or small strings and concatenating them is causing the memory blowup.

This should be pretty easy to optimize. I'm surprised nobody has run into this before, as this is a widely used package!

For reference, here is pchar:

segmentNz :: URIParser String
segmentNz =
    do  { ps <- many1 pchar
        ; return $ concat ps
        }

segmentNzc :: URIParser String
segmentNzc =
    do  { ps <- many1 (uchar "@")
        ; return $ concat ps
        }

pchar :: URIParser String
pchar = uchar ":@"

-- helper function for pchar and friends
uchar :: String -> URIParser String
uchar extras =
        unreservedChar
    <|> escaped
    <|> subDelims
    <|> do { c <- oneOf extras ; return [c] }
jgm commented 2 months ago

I've made the patch to parseURI, so you'll notice a new difference once a new version of network-uri is released; but it's not going to make a HUGE difference, because that function is still fairly inefficient.

We could think about trying a different URI library, maybe uri-bytestring.

ebeigarts commented 2 months ago

Thanks @jgm, really nice explanation