Backslash error when `xgettext` handles $$ formula

Demian101 commented 10 months ago

I think it is a xgettext BUG:

when I run

MDBOOK_OUTPUT='{"xgettext": {"pot-file": "messages.pot"}}' \
  mdbook build -d po

to generate message.pot file :

My source .md file

// my_md_file.md:

$$
\begin{array}{|c|c|c|c|c|}
\hline
1 & x_1 & x_2 & x_3 & out \\
\hline
0 & 1 & 1 & 0 & 0 \\
\hline
\end{array}
$$

When Convert to message.pot file:

// message.pot
#: src/plonk-arithmetization.md:35
msgid ""
"$$ \\\\begin{array}{|c|c|c|c|c|} \\\\hline 1 & x_1 & x_2 & x_3 & out \\\\ \\"
"\\hline 0 & 1 & 1 & 0 & 0 \\\\ \\\\hline \\\\end{array} $$"
msgid ""

As a result, you can see so many Backslash!!! rendering cannot be (mdbook-katex) performed!!!

mgeisler commented 10 months ago

Hi @Demian101, thanks for reporting this! I love LaTeX, so I would like to see this working :smile:

First, let me add a test which demonstrates how the backslashes are handled: #108.

This shows that the round-tripping might be surprising: when you enter \ in your document, the translation sees \\ since this is an equivalent way of entering a backslash. So if the mdbook-katex preprocessor doesn't understand this, you end up with a problem.

I haven't looked at how mdbook-katex works yet, but perhaps you could start by looking in their issues and documentation to see if they talk about how they handle backslashes?

Demian101 commented 10 months ago

Thanks/appreciate for your reply ~ ,

so I understand what you mean:

When converting text with Backslashes, converting a single \ to \\ is a standard behavior, so
```
$$ \\\\begin{array}{|c|c|c|c|c|} \\\\hline 1 & x_1 & x_2 & x_3 & out \\\\ \\"
"\\hline 0 & 1 & 1 & 0 & 0 \\\\ \\\\hline \\\\end{array} $$
```
is a correct (or at least compliant with markdown specification) way of processing.

The core question is : mdbook-katex needs to adapt and handle this way(above) and render it correctly

I understand this is what you meant in your reply, Am i on the point? 🤣

mgeisler commented 10 months ago

The core question is : mdbook-katex needs to adapt and handle this way(above) and render it correctly

I understand this is what you meant in your reply, Am i on the point? 🤣

Yes, I believe you got it correctly.

The problem is that multiple different Markdown documents can give the same result. The two paragraphs here are identical after parsing the Markdown:

\x

\\x

Check it in the commonmark.js playground by clickign the AST tab which says

  <paragraph>
    <text>\</text>
    <text>x</text>
  </paragraph>
  <paragraph>
    <text>\</text>
    <text>x</text>
  </paragraph>

for this example.

When we parse either paragraph for translation, we get a Rust string with r"\x" (backslash-x, 2 bytes), but when we turn it back into Markdown, we end up with r"\\x" (backslash-backslash-x, 3 bytes). We then save that to the PO file, which triggers another round of escaping so that you end up with \\\\x (5 bytes) in the PO file.

Now, you should not edit the PO file directly: use a PO editor instead. There are several online ones or you can install Poedit locally. When it displays the PO file, it will unescape it and show you \\x. But it's still annoying and confusing that there are "extra" backslashes like this.

I think fixing this would actually require a change in pulldown-cmark-to-cmark, which is the crate we use to turn the Markdown AST into Markdown text. The fix would be to use \x instead of \\x when this is the same according to the CommonMark spec. It's not always possible, though! Your input above shows such an example: when you have \\ in a Markdown file, you're actually entering a single logical \ because you're escaping the backslash.

It's all a bit ambigious and I would be curious to hear how mdbook-katex deals with this. Thanks for creating the issue there, I'll go subscribe to it now.

mgeisler commented 10 months ago

I think fixing this would actually require a change in pulldown-cmark-to-cmark, which is the crate we use to turn the Markdown AST into Markdown text. The fix would be to use \x instead of \\x when this is the same according to the CommonMark spec. It's not always possible, though! Your input above shows such an example: when you have \\ in a Markdown file, you're actually entering a single logical \ because you're escaping the backslash.

I think I was wrong here: it should be fine that pulldown-cmark-to-cmark turns \x into \\x in the Markdown text: the next step in the process won't be able to to tell. In particular, the final HTML output will only contain \x (2 bytes) since that is what \\x means in Markdown.

mgeisler commented 10 months ago

I created https://github.com/Byron/pulldown-cmark-to-cmark/issues/60 to describe the idea of emitting a simpler escaped form. Both would be correct, but translators will have an easier time working with $\sqrt{\frac{1}{x}}$ instead of $\\sqrt{\\frac{1}{x}}$ .

mgeisler commented 10 months ago

I've now learnt that mdbook-katex uses the raw Markdown input: https://github.com/lzanini/mdbook-katex/issues/100#issuecomment-1780611579.

This suggests a different approach: @Demian101 can you try adding configuring mdbook-katex to run before both mdbook-xgettext and mdbook-gettext? I believe you can do this with this configuration in your book.toml:

[preprocessor.katex]
after = ["links"]
before = ["gettext"]

The goal is to

Always run mdbook-katex, both when you output HTML and when you extract messages with mdbook-xgettext. You should no longer see equations in your PO files: instead you might get the HTML that I believe mdbook-katex inserts.
Run mdbook-katex before you do the translation with mdbook-gettext.

I think this should work, but you will lose the ability to translate the math. Let me know what you find out.

Demian101 commented 10 months ago

Thanks a lot for your great support! I tried and here're the conclusions:

my book.toml:

[preprocessor.katex]
after = ["links"]

just like before, can't render

Tip: the $\color{brown}brown$ block already looks like a latex formula that is actually ready to render, the only problem is that the \\ used for line breaks has been changed to \ (the red line i marked)

Here what I mean is, if we get the following form of the picture below, can it be successfully rendered? (I guess) there may be some subtle problems hidden here 🤣

$\begin{array}{|c|c|c|c|c|} \hline 1 & x_1 & x_2 & x_3 & out \\ \hline 0 & 1 & 1 & 0 & 0 \\ \hline \end{array}$

my book.toml: (Add before = ["gettext"] )

[preprocessor.katex]
before = ["gettext"]      <------ Attention here
after = ["links"]

Amazing! the annoying Latex works ~

but 🤣 , The problem is like above: All sentence with inline Katex is not rendered.

I speculate that: if there is inline Latex like $xx$ in a sentence, then the entire sentence is not processed by gettext (maybe?)

Demian101 commented 10 months ago

I built a minimal demo for you to have a try ~

you can just:

git clone https://github.com/Demian101/Demian101.github.io
cd Demian101.github.io
MDBOOK_BOOK__LANGUAGE=en mdbook serve -d book/en

In this demo, the en.po file and message.pot are almost empty. but the render of the formula when generating ./en folder failed.

so I think there is s.th. happened when gettext processing the raw .md ...

Or you can provide the source code of gettext, and I will try to fix it. What exactly happened when MDBOOK_BOOK__LANGUAGE=en mdbook serve -d book/en

How to debug:

you can just try to comment the after = ["gettext"] in book.toml to see what happened .

mgeisler commented 10 months ago

Hey @Demian101, thanks for documenting this! I don't have much time to look at this myself, but I've asked around internally and perhaps someone else will find the time to work on it.

One idea: mdbook has a Markdown output format which you should try enabling. See Configuring Renderers. That ought to show you in more detail how things are transformed.

kdarkhan commented 8 months ago

Hi @Demian101, I took a look at your repro repo. Not sure if I understood it correctly since you seem to have pushed some more commits after the last time you left a comment.

I tried to create a smaller POC for testing how stuff works and got it working at https://github.com/kdarkhan/mdbook-i18n-and-katex

The Github pages version is available here.

I believe you might have had your stuff broken because your PO files here were not re-generated after you updated mdbook-gettext / mdbook-katex execution order.

For instance, I found this msgid which should not be there. I think you mentioned that for inline latex, your translations stopped working. Reason for that could be because with katex executed earlier, msgids were updated and no longer matched the older version you had.

Based on my testing, when I run mdbook-gettext after mdbook-katex, gettext only sees generated MathJax nodes which are not original Latex.

The same latex table you had becomes a MathJax HTML in PO file.

If desired, Latex blocks could be skipped as I did here.

Let me know if I missed something.

mgeisler commented 8 months ago

Thanks @kdarkhan for looking into this!

kdarkhan commented 7 months ago

@Demian101 I will resolve this bug. Feel free to reopen if you are still facing this issue.

google / mdbook-i18n-helpers

Backslash error when `xgettext` handles $$ formula #105

How to debug: