thephpleague / html-to-markdown

Convert HTML to Markdown with PHP
MIT License
1.77k stars 205 forks source link

Added option to preserve comments #179

Closed straube closed 4 years ago

straube commented 4 years ago

This PR adds a new option to preserve comments. The current behavior is the default, i.e. strip all comments off.

As suggested by @Rarst on #177 the available options for preserve_comments are:

The option may be passed to the converter constructor, this way:

$markdown = new HtmlConverter(array(
    'preserve_comments' => true,
));
Rarst commented 4 years ago

Cheers!

I didn't get to testing before the prompt merge, but I still get the errant slash before preserved comment popping up from somewhere (same as my custom implementation in ticket).

$input = 'Right after this sentence should be a "continue reading" button of some sort on list pages of themes that show full content. It won\'t show on single pages or on themes showing excerpts.<br />
<br />
<!--more--><br />
<br />
And this content is after the more tag.';

var_dump( $input, (new HtmlConverter(['preserve_comments'=>['more']]))->convert($input) );
string(252) "Right after this sentence should be a "continue reading" button of some sort on list pages of themes that show full content. It won't show on single
pages or on themes showing excerpts.

 \<!--more-->

 And this content is after the more tag."

Could not figure out where is it coming from. :(

straube commented 4 years ago

@Rarst I found the issue. It's in ParagraphConverter. It escapes comments through escapeOtherCharacters method. I'm not sure why both < and > are escaped using backslashes by this converter since the Markdown specification doesn't register them as backslash escape characters. Do you or @colinodell know why we have them being escaped?

Rarst commented 4 years ago

Afraid I am not familiar enough with the process, just a downstream user...

colinodell commented 4 years ago

I believe they're escaped (or at least attempted to be escaped) because of input like this:

<p>Foo &lt;-- test --&gt; bar</p>

I'm not sure there's a good solution to this, at least not with the current architecture of this library :-/ We'd probably need to implement an intermediate AST to keep track of things like this when converting the parent elements, so we can remember whether or not those particular comments need to be preserved.

I'm certainly open to any other approaches that don't require a rewrite :upside_down_face: