matthewwithanm / python-markdownify

Convert HTML to Markdown
MIT License
1.12k stars 137 forks source link

Issue with line-breaks tags #58

Open isaring opened 2 years ago

isaring commented 2 years ago

Hi,

I'm facing an issue with line-breaks tags when they are written like <br/> instead of <br>.

Considering this simple example:

>>> import markdownify
>>> markdownify.markdownify("<p>11111<br>22222<br>33333<br/>44444<br><55555</p>", heading_style=markdownify.ATX)

Expected:

'11111  \n22222  \n33333  \n44444  \n55555 \n\n'

Actual:

'11111  \n22222  \n33333  \n\n\n'

My workaround is to .replace('<br/>','<br>') but it's a little pity...

Could you fix this in a future release?

Regards,

AlexVonB commented 2 years ago

Hi isaring,

interestingly, Beautifulsoup, the HTML parser we use, parses your code as <p>1<br/>2<br/>3<br>4<br/>5</br></p>, enclosing 4 and 5 in the non-existant br-tag-pair. I have no idea to why this happens. This would be a bug to be reported at the BS launchpad: https://bugs.launchpad.net/beautifulsoup/ I'm afraid that we cannot really do anything for you in this case :(

Keep us updated if you learn something new! Best

j77h commented 2 years ago

"<p>11111<br>22222<br>33333<br/>44444<br><55555</p>"

Did you notice? there's an extra "<" before the fives.

isaring commented 2 years ago

Oops, that's just a mistyping of my own! Unfortunately, it has no effect on the result.

LaundroMat commented 2 years ago

@isaring: you can convert your markdown yourself using the html5lib parser and use markdownify.MarkdownConverter to convert your html (see https://replit.com/@mathieud/DependentThinDowngrade#main.py).

import bs4
from markdownify import markdownify, MarkdownConverter

assert bs4.__version__ == '4.9.0'  # using lowest possible version

html = "<p>11111<br>22222<br>33333<br/>44444<br><55555</p>"

# Using html.parser
soup = bs4.BeautifulSoup(html, 'html.parser')

assert "<br>" in str(soup)
assert markdownify(html) == '11111  \n22222  \n33333  \n\n\n'

# Using html5lib parser
soup = bs4.BeautifulSoup(html, 'html5lib')

assert "<br>" not in str(soup)
assert MarkdownConverter().convert_soup(soup) == '11111  \n22222  \n33333  \n44444  \n<55555\n\n'

@AlexVonB: So it's not a bs4 issue, it's a parser problem. So should the parser at https://github.com/matthewwithanm/python-markdownify/blob/develop/markdownify/__init__.py#L96 be upgraded to html5lib for the next release of markdownify?

chrispy-snps commented 10 months ago

There indeed seems to be some kind of bug in the html.parser parser. I think there is a heuristic that tries to identify the <br>/<br/> convention of the content, because if only one style is used, then it seems to be parsed properly:

>>> print(bs4.BeautifulSoup('1<br>2<br>3', 'html.parser'))
1<br/>2<br/>3
>>> print(bs4.BeautifulSoup('1<br/>2<br/>3', 'html.parser'))
1<br/>2<br/>3

But if a mix is used, then html.parser seems to get confused:

>>> print(bs4.BeautifulSoup('1<br>2<br/>3', 'html.parser'))
1<br/>2<br>3</br>

whereas the other parsers do not:

>>> print(bs4.BeautifulSoup('1<br>2<br/>3', 'html5lib'))
<html><head></head><body>1<br/>2<br/>3</body></html>
                         ^^^^^^^^^^^^^

>>> print(bs4.BeautifulSoup('1<br>2<br/>3', 'lxml'))
<html><body><p>1<br/>2<br/>3</p></body></html>
               ^^^^^^^^^^^^^

Beautiful Soup tries to choose the best available HTML parser by default:

>>> print(bs4.BeautifulSoup('1<br>2<br/>3'))
<html><body><p>1<br/>2<br/>3</p></body></html>
               ^^^^^^^^^^^^^

It might be best to use its default behavior by default, but implement a Markdownify option that allows a particular parser to be explicitly requested.