goessner / markdown-it-texmath

Support TeX math equations with your Markdown documents.
MIT License
161 stars 29 forks source link

Cannot include escaped dollar when using dollars delimiters #32

Closed rogino closed 2 years ago

rogino commented 3 years ago

The regex for dollar delimiters breaks if the content includes an escaped dollar (\$), making it impossible to type dollar literals inside math blocks:

$$
\$10.00
$$

$\$10.00$

will all fail to be parsed as a latex block.

Workarounds include using a different delimiter style or use \text{\textdollar}, but getting it to work without these would be nice.

I've been to update the math_block regex from:

/\${2}([^$]+?)\${2}/gmy to

/\${2}((?:\\\$|[^$])+?)\${2}/gmy

but it doesn't cover all cases, and I haven't been able to figure out where to start with the math_inline regex.

edemaine commented 3 years ago

Unfortunately, regexes won't cut it to fix this problem in full generality, e.g. to support

$$ x = \text{what is $x$?} $$

Would you be willing to review a PR that rewrites the parsing engine to count {/}s and check whether $s are escaped with \, when processing in dollars mode?

goessner commented 3 years ago

hmm ... interesting.

Usually I am very reluctant with extending the regexes used in texmath. They are performance critical, as they are evaluated once with each user key stroke.

In fact escaped dollar sign \$ is a valid element inside latex math and interestingly it is working flawlessly inside of other non-dollar delimiters.

Dollar delimiters are special beasts. While having a short look into the math-inline regex

/\$((?:\S)|(?:\S.*?\S))\$/

I reidentified the final \S (last character before closing dollar) as a shortcut for [^\r\n\t\f\v ] (character exept), which might be easily extended to [^\r\n\t\f\v \\] presumably without too much performance cost.

After testing it in https://regex101.com/ successfully for several relevant cases, I considered it worth for also use it as a guard in the math-block regexes.

Surprisingly a drastic simplification is helping here ... from

/\${2}([^$]+?)\${2}/

to

/\${2}(.+?)\${2}/

Expect it available in the next version.

thanks ...

goessner commented 3 years ago

Erik, thanks for your potential help.

If there are some problems with my bug fix attempt, I would like to come back to your PR offer.

edemaine commented 3 years ago
/\${2}(.+?)\${2}/

Changing from [^$] to . excludes newlines. I think you want to allow single newlines (though you could forbid double newlines, as TeX does) within math. Instead of . you probably want [^], as in:

/\${2}([^]+?)\${2}/

But none of this fixes use of \$ (escaped $) or \text{$...$} (nested $) within a math expression.

One potential fix would be to use the regex as above, and then check whether the closing $ is actually escaped (had an odd number of \s before it), or has more unescaped {s than }s in it, and in that case, doing more work (probably another regex match to get contents until the next \${2}, concatenating, and trying again). This would mostly just take extra time in the weird edge cases of escaped and nested $s, which currently don't work, so seems like a win? But it would involve counting the number of {s and }s (without preceding \s) to make sure it's matched. I imagine this is all way faster than the cost of calling KaTeX, though.

Alternatively (and what I originally had in mind), the regex could match the opening \${2}, and then do a secondary search for { or } or `\${2}$, checking for escaping in each case, and repeat until finding the unnested closing notation. I could test to see which is faster in which cases.

goessner commented 3 years ago

In fact I had temporarily forgotten that '.' excludes newlines. So taking your proposed '[^]' works fine ...

inline: [
   {   name: 'math_inline_double',
       rex: /\${2}(.+?)\${2}/gy
   },
   {   name: 'math_inline',
       rex: /\$((?:[^\r\n\t\f\v \\])|(?:\S.*?[^\r\n\t\f\v \\]))\$/gy
  }
],
block: [
  {   name: 'math_block_eqno',
      rex: /\${2}([^]+?[^\\])\${2}\s*?\(([^)\s]+?)\)/gmy
  },
  {   name: 'math_block',
     rex: /\${2}([^]+?[^\\])\${2}/gmy
  }
]

... as you can see with this example code (https://github.com/goessner/markdown-it-texmath/blob/master/test/bug-dollars.html)

    const str = `
  # Simple Dollar tests
  ## Inline

  here "$a+\\$ = b$" we "$\\$$" go "$\\text{\\$some...\\$}$"

  ## Inline block (single line only)

  $$a+\\$ = \\text{\\$more...\\$} \\$$$  or ...

  ## Block (multiline)

  $$
  a+\\$ = \\text{\\$text...\\$} \\$
  $$
`

resulting in ... grafik

To also handle unescaped dollars inside of \text{$...$} I see effort to use relation as disproportionate. It seems to be reasonable to also escape dollars \text{\$...\$} in that edge case. So I would prefer to live with that small insufficiency of markdown-it-texmath for performance and simplicity reasons.

I hope I have not overlooked anything ... thanks.

edemaine commented 3 years ago

Minor points:

These new rules seem to deal with \$ properly. Nice!!

Nested $s are for re-entering math mode. \text{$x+y$} is different from \text{\$x+y\$}:

image

So it's not possible to escape these instances of $, as \$ means something in LaTeX.

Unmatched braces generate errors in KaTeX, though. (I see either Uncaught ParseError: KaTeX parse error: Expected '}', got 'EOF' at end of input or KaTeX parse error: Unexpected end of input in a macro argument, expected '}' at end of input: \text{.) So perhaps that could be detected, which triggers an "extension" regex? I believe the extension regex is exactly math_inline without the leading $, or applying math_inline but starting from the final $. On input $\text{$x+y$}$, after matching $\text{$, you'd next grab x+y$, fail, and then grab }$, and then succeed. This is quadratic time in the number of $s, but that's probably small... Alternatively, when in extension mode, we could count {/}s, so call KaTeX at most twice.

goessner commented 2 years ago

Sorry for the delay. Thanks for your valuable input.

* `math_inline_double` still uses `.`; should probably switch that to `[^]` too.

math_inline should be written on a single line, but I added it to be more forgiving here.

* I think you can use `[^\s\\]` instead of `[^\r\n\t\f\v \\]`. (The behavior is slightly different: the former treats all Unicode space identically. Probably better?)

This is definitely better ... taken.

* `[^]+?[^\\]` (which occurs in both block rules) seems to require at least two characters.  Should probably be `[^]*?[^\\]`.

Yes ... thanks for catching.

* Shouldn't `math_inline_double` have the same addition to exclude a trailing `\`?

sure ... done.

* `\$((?:[^\r\n\t\f\v \\])|(?:\S.*?[^\r\n\t\f\v \\]))\$` can be simplified to `\$([^\r\n\t\f\v \\]|\S.*?[^\r\n\t\f\v \\])\$` or `\$([^\s\\]|\S.*?[^\s\\])\$`. (I'm not quite sure why you're forbidding spaces next to the `$`s but I assume that's intentional, to avoid some stray matching.)

yes ... again a significant improvement.

Nested $s are for re-entering math mode. \text{$x+y$} is different from \text{\$x+y\$}: So it's not possible to escape these instances of $, as \$ means something in LaTeX.

Now I understand that reentering math mode effect. Escaping makes no sense indeed. I still consider it a not so relevant edge case in practise. I don't want to invest that significant implementation effort at current. A PR is always welcome though.

thanks again ...