thephpleague / html-to-markdown

Convert HTML to Markdown with PHP
MIT License
1.77k stars 204 forks source link

output code malformed #212

Closed 4meck closed 2 years ago

4meck commented 3 years ago
use League\HTMLToMarkdown\HtmlConverter;
$converter = new HtmlConverter(['header_style'=>'atx']);        
$html = '
<!-- noindex -->
<p>bla-bla</p>
<h2><span style="font-family: helvetica; font-size: 12pt;">&nbsp;</span></h2>
<h2><br /><span style="font-family: helvetica; font-size: 12pt;">bla2</span></h2>
<!--/ noindex -->        
';
$md = $converter->convert($html);
die($md);

output

<!-- noindex --><html><body>bla-bla

## <span style="font-family: helvetica; font-size: 12pt;"> </span>

##   
<span style="font-family: helvetica; font-size: 12pt;">bla2</span>

code malformed:

extra tags: <html><body>
missing: <!--/ noindex -->
bigsweater commented 2 years ago

This seems to be related to how DOMDocument is adding head and body tags to the HTML string being passed in. DOMDocument actually puts the first comment outside the html tag, and HtmlConverter gets mixed up.

Psy Shell v0.10.9 (PHP 8.0.12 — cli) by Justin Hileman
>>> $d = new DOMDocument()
>>> $d->loadHTML('<!-- opening --><p>hi</p><!-- closing -->')
=> true
>>> $d->saveHTML()
=> """
   <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">\n
   <!-- opening --><html><body><p>hi</p><!-- closing --></body></html>\n
   """

Including html and body tags in the string passed to loadHTML works, though.

Is that a bug in DOMDocument, or has it always done this?

colinodell commented 2 years ago

Hi @bigsweater,

It looks like DOMDocument has always done that - see this example of your code running on multiple PHP versions: https://3v4l.org/7bC33

bigsweater commented 2 years ago

Nice, never used that service before. Thanks for looking.

Does it make sense, then, to have HtmlConverter deal with the unexpected structure or is there somewhere else this can be fixed?

Happy to take a swing at it myself!