thephpleague / html-to-markdown

Convert HTML to Markdown with PHP
MIT License
1.77k stars 205 forks source link

Broken layout after conversion involving lists #195

Closed polcats closed 3 years ago

polcats commented 4 years ago

Input

text
<ul>
<li>a</li>
<li>b</li>
<li>c</li>
</ul>
text
<ol>
<li>2</li>
<li>3</li>
<li>4</li>
</ol>
text

Output

 text

- a
- b
- c

 text 1. 2
2. 3
3. 4

 text

The start of the ordered list doesn't have a line break

svenhaveman commented 3 years ago

Looks like this is php DOMDocument internal behaviour, and not an issue of this package.

When i let the script output the DOMDocument that it generated from your HTML:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<?xml encoding="UTF-8"><html><body><p>text
</p><ul><li>a</li>
<li>b</li>
<li>c</li>
</ul>
text
<ol><li>2</li>
<li>3</li>
<li>4</li>
</ol>
text
</body></html>

The paragraph tag executes the ParagraphConvert.php (instead of the TextConverter) and therefore adds the extra line breaks.

This is only at the first line. If you continue to add several lists to the page, it reacts as normal and as expected.

I would suggest to always use a valid HTML tag as first element when using html-to-markdown. Perhaps we could add this to the manual?

polcats commented 3 years ago

Thank you for clarifying this behavior! 💯 I am not working very much using PHP so I was clueless. I will be closing this issue.

markopy commented 3 years ago

I think this is actually a bug. It doesn't really matter how DOMDocument outputs the HTML. Browsers (absent css) will always render the first list element on a new line which is the sensible thing to do.

The reason is that all block level elements (which ol and ul are) always start on a new line per https://developer.mozilla.org/en-US/docs/Web/HTML/Block-level_elements

A block-level element always starts on a new line and takes up the full width available (stretches out to the left and right as far as it can).

html-to-markdown should output a newline before and after every block level element if there isn't one already.