jgm / skylighting

A Haskell syntax highlighting library with tokenizers derived from KDE syntax highlighting descriptions
189 stars 61 forks source link

Store compiled regexes in RE #166

Closed SquidDev closed 1 year ago

SquidDev commented 1 year ago

I'm currently working on a project which contains a large number (4k+) of Agda code blocks in markdown files. When debugging some performance issues we were having with our build process, I noticed that we were spending a large amount of time in Pandoc's HTML writer.

Using GHC's profiler, we can see that most of the time is spent in compiling and parsing Skylighting's regex.

A screenshot of a Speedscope profile. The majority of the time is spent in an `option` function (from Attoparsec), which is called from `compileRegex` and `tokenize` (from Skyulighting) and in turn from Pandoc's `writeHTML5String`. The whole process takes 22s

In order to avoid repeated re-compiles of regexes for each call to tokenize, this patch (lazily) compiles the regex when constructing a RE value. This ensures the compiled regex is shared across all usages of the syntax.

This has a significant affect on performance, reducing the total build time from ~38s to ~24s. Another profile for comparison (though I'm a little dubious of the relative speedup in profiling builds):

Another screenshot of a Speedscope profile. Tokenize now takes up much less time (~1.5s), and the whole process only takes 6.5s.

There is some awkwardness here, as we now need to derive all the type classes manually. This is especially irritating for the Show/Read instances. I'm not sure there's a good alternative here.

I've tried to hide the hide the internals of this - we're using pattern synonyms to keeping the interface the same as before (RE { reString, reCaseSensitive }).

jgm commented 1 year ago

Excellent. I contemplated doing this originally, but I was too lazy to write all the instances! I'll take a closer look when I have a moment.