tim-gromeyer / html2md

Transform your HTML into clean, easy-to-read markdown with html2md.
https://tim-gromeyer.github.io/html2md/
MIT License
21 stars 2 forks source link

Breaks not converted #94

Closed petko closed 3 months ago

petko commented 5 months ago

Describe the bug A have a simple HTML with <br> tags at the end of the lines and they are not converted properly.

To Reproduce Run html2md.exe breaks.html -p with the following HTML document:

<!DOCTYPE html>
<html lang="en">
  <head><meta charset="utf-8"></head>
<body>
  line 1<br>
  line 2<br>
</body>
</html>

You will get:

line 1
<br>

line 2
<br>

Expected behavior Should convert <br> to a new line instead.

DWesl commented 3 months ago

Are you expecting the output to be

line 1
line 2

to be turned into html as

<body>
<p>line 1<br/>line 2</p>
</body>

or

line 1

line 2

to be converted to

<body>
<p>line 1</p>
<p>line 2</p>
</body>

perhaps with some CSS to increase spacing between paragraphs?

I think the first html would only show up if the markdown was

line 1<br/>
line 2

GitHub's Markdown parser tends to honor line breaks in the Markdown source, but given that the CommonMark parser has special syntax for specifying line breaks I don't think that's actually part of Markdown itself. That is, the only portable ways to force Markdown to put a newline between the "1" and the following "line" are to put them in separate paragraphs (with a blank line between them) or to use a <br/> HTML tag between them.

Given there are no <p> tags in your input, I think the output is a decent guess at the intended meaning of the provided HTML.

Out of curiousity, does the program behave more like what you expect if you replace the <br> tags with <p> tags (or a </p><p> sequence, if you don't mind also inserting a <p> after <body> and a </p> before </body>)?

petko commented 3 months ago

Are you expecting the output to be

line 1
line 2

Yes, that is why I expect with this HTML markup.

P.S.: My app does not generate such HTML, It is just something that I was testing..

rsyring commented 3 months ago

FWIW: I think the
should be a newline wherever its encountered. The current implementation seems to ignore it in a paragraph:

html2 = """
<p>Contact: <br/> Isabella Bobillo <br/> Fish Consulting <br/> 954-893-9150 <br/>ibobillo@fish-consulting.com</p>
"""

print(pyhtml2md.convert(html2))

and the output is:

Contact:  Isabella Bobillo  Fish Consulting  954-893-9150 ibobillo@fish-consulting.com

But I'd expect it to be:

Contact:  
Isabella Bobillo  
Fish Consulting   
954-893-9150  
ibobillo@fish-consulting.com

Note that there are two spaces (per Markdown spec) at the end of each line of that output except the last.

Regarding the OP's:

  line 1<br>
  line 2<br>

I agree with the last comment about what is expected.

tim-gromeyer commented 3 months ago

Yes, you are right, html2md seems to have problems with line brakes with a closing tag inside (<br/>). With only <br> it seems to work. Fixing it...

tim-gromeyer commented 3 months ago

Should be fixed with the latest commit, will create a new release soon...