scruel / tampermonkey-scripts

Naïve scripts by naïve ideas.
41 stars 3 forks source link

Some Markdown formatting symbols * or _ in inline LaTeX formulas may cause rendering errors #12

Closed lethefrost closed 1 year ago

lethefrost commented 1 year ago

On ChatGPT's web interface, if such formulas appear in the form of {x}_a and y_{b} (where x, y, a, b can be any elements, they can be either the same or different, the key point is that there are two _ in the same paragraph, and, the one character just immediately before the former _ and the one after the latter _, are symbols instead of numbers/letters. The same goes for *), the _s will be first consumed as the italic formatting character of Markdown, causing unwanted rendering results.

I made the following table to demonstrate some test cases (I suddenly found that GitHub also has this problem? I am now confused about whether this is a bug or a feature... Are they using the same rendering module? I just tried some other Markdown toke taking apps such as Obsidian, Logseq, and so on, and pandoc for exporting Markdown to PDF, and they all render correctly... For now I only found GitHub and ChatGPT have this issue.)

Source Text Rendered Result Explanation
$x_a$ and $y_b$ $x_a$ and $y_b$ If rendered correctly
${x}_a$ and $y_{b}$ ${x}a$ and $y{b}$ Being treated as italic formatter
${x}_{a}$ and ${y}_{b}$ ${x}{a}$ and ${y}{b}$ As long as the characters before the first _ and after the second _ are symbols, it will cause this bug, and it doesn't matter whether it is a symbol elsewhere
${x}_a = y_{b}$ ${x}a = y{b}$ Even if it is not two separated inline formulas in the same line but in the same formula, it will be treated as italic
$x^@_i = y_@$ $x^@i = y@$ It doesn't have to be characters like {}, as long as the adjacent characters are not a letter/number, it will cause this bug
$x^2_i = y_2$ $x^2_i = y_2$ If the adjacencies are numbers, it will be rendered correctly
${x}^{*} = {y}^{*}$ ${x}^{} = {y}^{}$ Consuming * and becoming italic
$\mathbf{w}_t$ and $\mathbf{w}_{t+1}$ $\mathbf{w}t$ and $\mathbf{w}{t+1}$ Such formulas are quite common in gradient descent and other fields, but cannot be rendered correctly…

Screenshots:

Do you have any idea what this bug or feature is caused by? Do GitHub and ChatGPT share the same module rendering Markdown? But why wasn't ChatGPT originally able to render the inline formulas? I am getting really confused here.

I am wondering would it be possible for you to catch the _ characters in the source text and render it by MathJax before they are consumed by the Markdown renderer. Looks like the source text of GPT generated response can be obtained by some means. For example, the repo chatgpt-exporter does so.

Hope we can figure this out. Thank you so much! I greatly appreciate your work, and it indeed helped me a lot. ❤️

scruel commented 1 year ago

Hi, thanks for posting this issue with well explanation, after checking and reproduce the cases you provided, I already know why this happened, I will try to fix this. The reason why chatgpt-exporter won't have this issue is because it will fetch the raw content from API directly, rather than parse the HTML content which polluted by other renderers.

scruel commented 1 year ago

Temporary fixed in 0.5.8.

scruel commented 1 year ago

We reversed the HTML em tag back to _ only, so it will lead to another issue: Because the Markdown render won't let us be possible to know the original symbol used to render the em tag is either _ or *, so for the formula ${x}^{*} = {y}^{*}$, the script won't be able to typeset it correctly. Will try to fix this also.

lethefrost commented 1 year ago

Hi, thanks for posting this issue with well explanation, after checking and reproduce the cases you provided, I already know why this happened, I will try to fix this. The reason why chatgpt-exporter won't have this issue is because it will fetch the raw content from API directly, rather than parse the HTML content which polluted by other renderers.

Hi scruel, thank you!! God. You are really amazing! How fast you are identifying the issue. I don't understand why at all, and wonder if you would mind taking some time to explain to me why ChatGPT and GitHub are both having issues like this? Really appreciate your quick response and fix. It's so impressive!

Also, thank you for also recognizing the * problem for the temporary solution! I am thinking of some unique syntax for _ might help distinguish the cases of * or _. For example, if the <em> tags matches the following cases, they cannot be _ originally (and i.e. must be *), because of the syntactic rules of $\LaTeX$,

Thank you again for your work! Really appreciate it. Hope I am helping 😊. I just found a new bug with the 0.5.8 release and I will raise a new issue for this.

scruel commented 1 year ago

@lethefrost For GitHub, I can't be sure the reason why caused this. The problem here for ChatGPT page, is caused by the wrong rendering order, like you said before, we should first typeset LaTeX formulas, then render Markdown formats, but as a script (without injecting), it can only do it at the end. Your provided cases are helpful, I will consider them while I am fixing this. Currently, I can fetch the raw content for matching to confirm the original symbol, however, I think I will have to reverse some parts of HTML back to Markdown to do the match, and this "reverse" processing also will cause some problems, so for fixing this, I will need some time.

scruel commented 1 year ago

Fixed in 0.6.0.