leanprover / doc-gen4

Document Generator for Lean 4
Apache License 2.0
62 stars 41 forks source link

Escape sequences are handled incorrectly in LaTeX #178

Closed eric-wieser closed 2 months ago

eric-wieser commented 5 months ago

The docstring here, which is

/-- Given modules `M`, `M₂` over a commutative ring, together with submodules `p ⊆ M`, `q ⊆ M₂`,
the set of maps $\{f ∈ Hom(M, M₂) | f(p) ⊆ q \}$ is a submodule of `Hom(M, M₂)`. -/

renders as

Given modules M, M₂ over a commutative ring, together with submodules p ⊆ M, q ⊆ M₂, the set of maps $f ∈ Hom(M, M₂) | f(p) ⊆ q$ is a submodule of Hom(M, M₂).

instead of the expected (note the {}s):

Given modules M, M₂ over a commutative ring, together with submodules p ⊆ M, q ⊆ M₂, the set of maps $\{f ∈ Hom(M, M₂) | f(p) ⊆ q \}$ is a submodule of Hom(M, M₂).

Interestingly github seems to agree with this interpretation:

Given modules M, M₂ over a commutative ring, together with submodules p ⊆ M, q ⊆ M₂, the set of maps ${f ∈ Hom(M, M₂) | f(p) ⊆ q }$ is a submodule of Hom(M, M₂).

I think this is caused by the markdown parser not really supporting $$ notation; it should not be processing markdown escape sequences inside embedded latex code, just as it already does not process markdown escape sequences in code blocks.

eric-wieser commented 5 months ago

(I think this has been reported on Zulip before, but I couldn't find a tracking issue)

acmepjz commented 3 months ago

just as it already does not process markdown escape sequences in code blocks.

This is a natural feature of markdown processor I think. But generally markdown processor (in our case it is a C library cmark) does not understand LaTeX. So I think before feeding the docstring to markdown processor, we need to filter all LaTeX formulas.

EDIT: I have tested that cmark indeed will convert $\{ \}$ to ${ }$.

PS: M₂ is not a valid LaTeX formula; it should be M_2.

acmepjz commented 3 months ago

I think I can try to look at this issue.

eric-wieser commented 3 months ago

PS: M₂ is not a valid LaTeX formula; it should be M_2.

Mathjax is quite happy to render $M₂$, though it admittedly looks worse than $M_2$

eric-wieser commented 3 months ago

So I think before feeding the docstring to markdown processor, we need to filter all LaTeX formulas

This is not a sensible approach; the markdown parser needs to natively know where a formula starts and end; if you do what you suggest, then things like This is a `$` with not latex `$` will render incorrectly. If you teach your filtering about code syntax, then now you have a hacky parser wrapped around a real one, and will almost certainly create more bugs.

acmepjz commented 3 months ago

If you teach your filtering about code syntax, then now you have a hacky parser wrapped around a real one, and will almost certainly create more bugs.

Unfortunately this is inevitable if we want a quick fix. The proper fix perhaps is waiting for https://github.com/leanprover/verso

bryangingechen commented 2 months ago

So I think before feeding the docstring to markdown processor, we need to filter all LaTeX formulas

This is not a sensible approach; the markdown parser needs to natively know where a formula starts and end; if you do what you suggest, then things like This is a `$` with not latex `$` will render incorrectly. If you teach your filtering about code syntax, then now you have a hacky parser wrapped around a real one, and will almost certainly create more bugs.

If I understand the suggestion correctly, this is actually what we did in doc-gen3 to get LaTeX working there, but it was indeed a bit of a hack: https://github.com/leanprover-community/doc-gen/pull/110

(Basically, we ported the code that math.stackexchange and mathoverflow use for filtering LaTeX formulas out of markdown, hoping that (1) if there were any bugs in that parser it would have been found by someone using those sites already (2) we wouldn't introduce any new bugs porting this incredibly delicate JS code to Python.)

edit: I think everything I wrote in this Zulip post still stands (including my offer to help if someone wants to take this approach)

acmepjz commented 2 months ago

I am investigating with other solutions. So far I found another markdown parser written in C https://github.com/mity/md4c which claims supporting LaTeX formulas, tables, etc. (Currently we are using https://github.com/commonmark/cmark.) Hopefully this will be fixed after switching markdown processor.

bryangingechen commented 2 months ago

Yes, if switching the processor works that would be easiest! By the way, if you need some examples to test LaTeX embedded in markdown, I put together this markdown file (see also the result rendered by doc-gen3).

edit: Interestingly, our test file reveals various issues(?) in GitHub's handling of LaTeX in markdown.