thephpleague / html-to-markdown

Convert HTML to Markdown with PHP
MIT License
1.77k stars 204 forks source link

#212: Find and move any misplaced comment nodes #214

Closed bigsweater closed 2 years ago

bigsweater commented 2 years ago

This is a fix for https://github.com/thephpleague/html-to-markdown/issues/212: if you pass HTML that begins with a comment (like <!-- uh oh --><p>hi</p>) to HtmlConverter->convert, the resulting markdown looks like this: <!-- uh oh --><html><body>hi\n.

This is because DOMDocument->loadHTML actually puts that first comment at the root of the document, outside the html and body tags. The sanitize method only removes html and body tags if they're at position 0 of the markdown string -- but with the comment at the root of the document, the position of the tags will always be > 0, so they never get removed (and that first comment is never removed, either).

So this adds a step to the createDOMDocument method: it finds any comments at the root of the DOMDocument and prepends them to the <body> tag.

Evidently DOMDocument has always behaved this way, so maybe this isn't the correct fix?