Open ebeigarts opened 3 months ago
This does seem to have to do with inline (base64-encoded) images specifically. I tried the same file but with a linked image, and it only used 30 MB.
Odd that even html -> md
takes a lot of memory, even though the URL should just be passed through unchanged.
If we do html -> json
and then json -> md
, that takes 956M for the first step and 810M for the second. So, it's neither a reader nor a writer issue exclusively. json -> html
takes 1768M. But json -> json
is fast.
I'd need to do some profiling to track this down further.
Profiling, first three entries for html -> md
:
COST CENTRE MODULE SRC %time %alloc
parseURI Network.URI Network/URI.hs:301:1-26 17.4 21.5
escapeURI Text.Pandoc.URI src/Text/Pandoc/URI.hs:(31,1)-(32,65) 17.2 21.0
parseURIReference Network.URI Network/URI.hs:308:1-44 16.1 21.5
parseURI
gets called in both the reader and the writer, it seems.
In the reader as part of canonicalizeUrl
.
IN the writer as part of isURI
(which is used to check whether an "auto-link" should be used).
It seems that parseURI
may be slow and could perhaps be optimized (it's in Network.URI
so not part of pandoc).
We could also think about whether we could avoid calling it.
Yeah, here's the problem (network-uri
):
Network/URI.hs
The URI parser parses multiple segments
and concatenates them. (Each segment is basically a path component starting with /
). But look at the segment
parser:
segment :: URIParser String
segment =
do { ps <- many pchar
; return $ concat ps
}
This parses many small strings, one for each pchar
(usually just one character!) and then concatenates them. I think that allocating thousands or millions or small strings and concatenating them is causing the memory blowup.
This should be pretty easy to optimize. I'm surprised nobody has run into this before, as this is a widely used package!
For reference, here is pchar:
segmentNz :: URIParser String
segmentNz =
do { ps <- many1 pchar
; return $ concat ps
}
segmentNzc :: URIParser String
segmentNzc =
do { ps <- many1 (uchar "@")
; return $ concat ps
}
pchar :: URIParser String
pchar = uchar ":@"
-- helper function for pchar and friends
uchar :: String -> URIParser String
uchar extras =
unreservedChar
<|> escaped
<|> subDelims
<|> do { c <- oneOf extras ; return [c] }
I've made the patch to parseURI, so you'll notice a new difference once a new version of network-uri is released; but it's not going to make a HUGE difference, because that function is still fairly inefficient.
We could think about trying a different URI library, maybe uri-bytestring.
Thanks @jgm, really nice explanation
Here is an example HTML file (10Mb) with one embedded JPEG image (7.6Mb).
Memory usage:
html
tomd
uses 2985Mmd
todocx
uses 3435Mhtml
todocx
uses 4350MTest examples:
OS: macOS 14.14.1, m3/arm