matthewwithanm / python-markdownify

Convert HTML to Markdown
MIT License
1.17k stars 140 forks source link

be more selective about escaping special characters #122

Closed chrispy-snps closed 6 days ago

chrispy-snps commented 7 months ago

This is a refinement of #118 (thanks @jsm28!).

The current solution escapes every instance of every special character. Although conservative, this can lead to unnecessary escaping. For example,

>>> from markdownify import markdownify as md

>>> md('Pick a color (just 1) to use.')
'Pick a color (just 1\\) to use.'

>>> md('Pick a color, just 1. Write it down.')
'Pick a color, just 1\\. Write it down.'

>>> md('1 + 1 = 2')
'1 \\+ 1 \\= 2'

>>> md('start ----> end')
'start \\-\\-\\-\\-\\> end'

In our use case, our input content is technical documentation (many special characters) and the content is subsequently edited by humans, so it is desirable to minimize unnecessary escaping.

This pull request seeks to strike a balance between the following:

The tests cover a variety of required and unnecessary escaping cases, which can hopefully avoid any future regressions in escaping behavior.

This approach is not foolproof. Markdownify processes each text fragment in isolation, and thus the beginning of a particular string might not be the beginning of an output line. As a result, patterns are not applied across text fragment boundaries (such as adjacent <span> elements). Handling this probably requires a larger rework of the text processing code.

I also noticed that the original code had def text_misc() instead of def test_misc(), which caused the tests never to run.

chrispy-snps commented 7 months ago

@jsm28, @AlexVonB - what are your thoughts? I understand this is not as comprehensive as the "escape-everything" approach, but I'm hoping it strikes a balance between catching realistic scenarios while avoiding unnecessary scenarios. If issues are filed for false negatives or false positives, the approach can be refined (and the test set updated accordingly).

jsm28 commented 7 months ago

Being at start of line isn't a very safe test at this stage either, because wrapping takes place later and can move text to start of line that only has significance there, so anything after a space should effectively be considered to be at start of line when wrapping, and this affects most of the examples given in this issue. I was considering making the escaping smarter anyway because there are lots of examples in my use case (converting 547 issues filed against past versions of the C standard and related documents for a new issue tracker) where it can safely be determined that the escaping is unnecessary in any position, such as 1.2.3.4 or X-Y. (My preprocessing removes all <span> - or converts it into other markup after processing a subset of the CSS found in some of the input - before the input gets passed to markdownify and I don't think fragment boundaries can be an issue in my case.)

(Yes, wrapping is significantly buggy at present, in particular losing <br> in the input; that's on my list of issues to fix.)

chrispy-snps commented 7 months ago

@jsm28 - aww dangit, I forgot about wrapping. I wonder if any of the work in this pull request is salvageable?

What are your thoughts on how to make escaping smarter? Perhaps we could escape everything, then apply a set of regular expressions after wrapping that removes unnecessary escapes (although this doesn't give me the warmest fuzzy feeling).

Possibly related, I've wanted to figure out a fix for adjacent <p> elements not getting a space between them in table cells:

>>> from markdownify import markdownify as md
>>> html_doc = """
... <table>
...   <tr>
...     <td>
...       <p>abc</p><p>def</p>
...     </td>
...   </tr>
... </table>
... """
>>> md(html_doc)
'\n\n\n| abcdef |\n| --- |\n\n\n'
jsm28 commented 7 months ago

FIxing test function naming (plus any consequent fixes needed for tests to pass) is definitely salvageable!

I was thinking mainly about . and - as the main cases I saw where escaping obviously wasn't needed (my input data has a lot of subclause references in the form 6.1.2.3). For ., for example, restricting escaping to the case where . is preceded by digits that are preceded by either whitespace or start of string, and is followed by either whitespace or end of string, should suffice. (For the <span> case, add the case where . is at the start of the string and so we don't know if there might be digits preceding it.) For -, a sequence of one or more consecutive - should only need escaping if preceded by whitespace or start of string, and followed by whitespace or end of string (and this should work even for the <span> case).

AlexVonB commented 6 days ago

Hey! This was probably closed by #149 . Thanks for your time and patience!