thephpleague / html-to-markdown

Convert HTML to Markdown with PHP
MIT License
1.77k stars 204 forks source link

Escape characters incorrectly added in front of valid markdown bullets. #221

Closed deetergp closed 2 years ago

deetergp commented 2 years ago

Version(s) affected

5.1

Description

Given a string that contains a combination of HTML line breaks and markdown bullets, when the the HTML is converted, the bullets are escaped. For example:

String

"List of stuff:<br />- List item one<br />- List <a href="http://foo.com" target="_blank" rel="noreferrer noopener">item</a> two<br />* List item [three] with braces"

Expected Result

List of stuff:
- List item one
- List [item](http://foo.com) two
* List item [three] with braces

Actual Result

List of stuff:
\- List item one
\- List [item](http://foo.com) two
\* List item \[three\] with braces

How to reproduce

See description.

deetergp commented 2 years ago

Hmm, so the change is happening here but I don't really understand why. Happy to propose a fix once I understand why that method exists at all.

pandymic commented 2 years ago

I don't believe this is an issue. The purpose of this library is to function as a general purpose conversion from one specific data type (HTML) to markdown. In your example you have an input string that has a mixed set of HTML code as well as markdown code with the expectation that the converter will be aware of this and handle each data type accordingly. This is a misconception.

The purpose of the method you've identified is to avoid formatting problems when converting what it sees as basic paragraph text. The resulting strings may contain characters which can be erroneously parsed as markdown by an interpreter further up the stack. Since its job is to return basic paragraph text it will intentionally escape those characters,

Correct usage would have your list presented as well-formed HTML using <ul> and <li> tags for the converter to then turn into appropriate markdown.

You can perform your conversion in multiple passes to get around this problem. First run your code as-is to get the Actual Result then run that string through another method that locates the desired escaped characters and un-escapes them as needed until it returns your Expected Result

For example (this is unexecuted pseudo-code):

// run converter
$markdown = $converter->convert( $html );

// regex to replace the first escaped asterisk or hyphen character on each line.
// alter or expand as needed for other characters (numbered lists, etc.)
$markdown = preg_replace( '/^\\([-\*]\s)/m', '\1', $markdown ); 
colinodell commented 2 years ago

Sorry for not seeing the original issue report! I agree with @pandymic, this is expected behavior.

Ideally, the resulting Markdown should return the same HTML when you run it through a Markdown parser. The escape characters are necessary to make this happen, and are valid Markdown, and thus this is correct. Plug your actual and expected results into https://spec.commonmark.org/dingus/ to see :)