jgm / typst-hs

Haskell library for parsing and evaluating typst
Other
44 stars 5 forks source link

Regex inconsistencies #28

Open PgBiel opened 10 months ago

PgBiel commented 10 months ago

Hello, I've observed several inconsistencies between the regex pandoc uses when reading Typst documents and the regex Typst uses.

Here are a few of them:

  1. Not all flags are supported. Typst regex supports the flags i, m, s, u, x. Of those, only i appears to be supported by Pandoc. For example, #(regex("(?m)a") in "A") compiles in Typst, but doesn't in Pandoc (3.1.11.1 via try.pandoc.org), with the error (line 1, column 2): parseRegex for Text.Regex.TDFA.Text failed:"({0,1}m)a" (line 1, column 4): unexpected '0' expecting an atom.
    • I especially miss the m (multiline) flag in order to be able to match the start of a line with ^ and the end of a line with $.
  2. Unnamed capture groups are not supported: #(regex("(?:x)") in "x") compiles in Typst, but not in pandoc ((line 1, column 2): parseRegex for Text.Regex.TDFA.Text failed:"({0,1}:x)" (line 1, column 4): unexpected '0' expecting an atom).
    • This is needed to avoid unnecessary capture groups in the output, and is frequently used across my packages.
  3. Explicitly named capture groups are not supported: #(regex("(?P<a>x)") in "x") compiles in Typst, but not in Pandoc ((line 1, column 2): parseRegex for Text.Regex.TDFA.Text failed:"({0,1}P<a>x)" (line 1, column 4): unexpected '0' expecting an atom).

Besides non-compilation, there are inconsistencies in the results of regex matching as well.

  1. #(regex("[\s\S]+") in "x") returns true in Typst, but false in Pandoc.
  2. #("a \n b" == "a \n b".match(regex("[^.]+")).text) returns true in Typst, but false in Pandoc. In general, [ ] seems to unable to accept newlines, when it should.

There are probably inconsistencies I haven't found yet as well, but they could be added to this issue as they are found.

jgm commented 10 months ago

Yes. Problem is that we can't just use the regex engine typst uses. We are limited to the Haskell ecosystem. So what I do is use the regex-tdfa package for the basics, and try to supplement it when possible for things it is missing. E.g. it is missing \d \w \s, ?, and +, so I just replace these with equivalents. Of course, this isn't 100% reliable, and we can already see a place where it produces bad results in your #1 -- (?m) is a special construction; ? here doesn't mean "0 or 1", but my hack just replaces the ? with {0,1} with terrible results.

I could switch to using another regex engine. Hackage has regex-pcre-builtin, which comes with the C sources so that an external dependency isn't introduced. I've tried to avoid using wrapped C libraries in pandoc, but maybe could reconcsider in this case. I imagine pcre would be pretty close.

I also reimplemented as much as I needed of the regex engine used by KDE for my skylighting library. This isn't currently published as a separate package, though.

jgm commented 10 months ago

Oh, I see there is now https://hackage.haskell.org/package/regex-rure But this would make pandoc depend on an installation of librure.so/dylib somewhere; I want to avoid that and have a perfectly self-contained static binary.