miyuchina / mistletoe

A fast, extensible and spec-compliant Markdown parser in pure Python.
MIT License
811 stars 113 forks source link

Ensure LaTeX renderer uses valid \verb delimiter #150

Closed joel-coffman closed 2 years ago

joel-coffman commented 2 years ago

The LaTeX renderer uses \verb for inline code, but the delimiter is always a vertical bar, which produces incorrect output when the inline code also contains a vertical bar (e.g., example | pipe).

Rather than using a single static character (i.e., a vertical bar), this change modifies render_inline_code to search for a non-letter delimiter that does not appear in the inline code. If no such delimiter can be found, a RuntimeError is raised to avoid incorrect output.

Note that the list of possible delimiters is not exhaustive. For example, numbers (0, 1, 2, etc.) are all valid delimiters for \verb but are omitted from the search.

Fixes #149

pbodnar commented 2 years ago

@joel-coffman, thanks for this PR, I like it. :)

Maybe just 2 questions to the delimiters set - currently it is:

|+=!@#$^&`~-_:

I have read that the only character which cannot be used is *. So I wonder if you don't want to add some other characters like %, ;, / or even ?, "just for completeness". I also wonder about the order of the possible delimiters - do we set the order more or less randomly, or is it possible / does it pay off to do something like "(estimating) frequency analysis" for choosing "the best order"?

I guess you answer me that it probably wouldn't make any big difference, but I'm still asking for the case you would like to do some amendments after all.

joel-coffman commented 2 years ago

Good questions! Let me try my best to respond.

Maybe just 2 questions to the delimiters set - currently it is:

|+=!@#$^&`~-_:

I have read that the only character which cannot be used is *. So I wonder if you don't want to add some other characters like %, ;, / or even ?, "just for completeness".

I wasn't sure where to find an exhaustive list of the possible delimiters. We may overlook at least one no matter how hard we try. :-/

Maybe string.punctuation (excluding, of course, *) would be a better choice than trying to enumerate them all here? Of course, even that set omits digits (0, 1, ..., 9) so I think it really comes down to 1) how paranoid we should be trying to find a valid delimiter and 2) the potential for confusion when reading the output (e.g., \verb(a | b( is valid but just looks wrong). Given that we're trying to handle arbitrary code that can contain any character, I don't think a general solution is possible. With that said, I think string.punctuation is better than the current implementation, and I'm happy to update this merge request if you concur.

I also wonder about the order of the possible delimiters - do we set the order more or less randomly, or is it possible / does it pay off to do something like "(estimating) frequency analysis" for choosing "the best order"?

I thought it would be good to have the vertical bar (|) first to preserve the current behavior as long as the inline code doesn't contain a vertical bar itself. That is, a document created before this change is merged would be the same as a document created after this change is merged unless the inline code happens to contain a vertical bar.

Considering the programming languages that I know, I cannot think of a single ordering of the possible delimiters that works well for all of them. If you have ideas, then I'm certainly open to them.

pbodnar commented 2 years ago

@joel-coffman, sorry for the delay, I'm back again...

Firstly, thank you for your analysis. :)

Regarding the characters set, string.punctuation does seem like a good source of inspiration, although I would also vote for omitting any kind of brackets from the list and to keep an own constant for this.

Regarding characters order, I agree that | needs to go first and I don't have any special preference over the rest, except maybe just that:

joel-coffman commented 2 years ago

@joel-coffman, sorry for the delay, I'm back again...

Firstly, thank you for your analysis. :)

Regarding the characters set, string.punctuation does seem like a good source of inspiration, although I would also vote for omitting any kind of brackets from the list and to keep an own constant for this.

Regarding characters order, I agree that | needs to go first and I don't have any special preference over the rest, except maybe just that:

  • ! could go as the 2nd one (as it is quite similar to |)
  • either order the rest of the characters like they do it in string.punctuation, or order them according to their location on the standard English keyboard. I would probably leave it up to you to decide. ;)

Addressed by 5d9b154. The renderer now has a constant for the delimiters, which can be overridden to omit braces, brackets, etc. In short, everything possible is tried, and if the user doesn't like the exhaustive set of possible delimiters, then they can restrict the set however they please.

joel-coffman commented 2 years ago

@joel-coffman, thanks a lot. I haven't expected you will choose the harder way, but I think it is basically just fine, so I would accept it. Although can you please have a look at my small improvements suggestions firstly? :)

Great suggestions, @pbodnar! All are addressed by b37787f.

pbodnar commented 2 years ago

Great suggestions, @pbodnar! All are addressed by b37787f.

Nearly there, thank you! :) I think I will merge this by rebasing (to keep the history simple), I would ask you just for 2 small cleanups before that:

  1. Remove the now-redundant import string in "test\test_latex_renderer.py".
  2. Squash the last "cleanup" commits into the 2nd commit Try all possible delimiters for \verb, while changing the summary line to for example Try all possible delimiters for LaTeX \verb.

OK for you?

pbodnar commented 2 years ago

Addendum: It also helps to have the issue number at the end of the 1st commit message line, like this: Ensure LaTeX renderer uses valid \verb delimiter (#149).

joel-coffman commented 2 years ago

Good catch on the unused import!

I think that these issues / requests have been addressed, but let me know if I've overlooked anything.

pbodnar commented 2 years ago

@joel-coffman, thank you. Strangely enough, the Compare button here doesn't show any differences, yet the import string seems to be correctly removed - so regarding the content, I think it is perfect now. :)

Only regarding the commit messages' summaries, I hope I don't bother you too much about this ;), I would prefer something like this (to sum up my previous suggestions):

joel-coffman commented 2 years ago

No problem, @pbodnar! I've updated the second commit message to include the issue number in the summary. Please let me know if you'd like anything else related to this change.

pbodnar commented 2 years ago

@joel-coffman, I was hoping for also including the word LaTeX. :)

joel-coffman commented 2 years ago

Ah, sorry! I missed that when I read your comment. Anyway, the latest rebase includes "LaTeX" in the second commit message summary.

pbodnar commented 2 years ago

All OK now, thank you! :)