Open isaring opened 2 years ago
Hi isaring,
interestingly, Beautifulsoup, the HTML parser we use, parses your code as <p>1<br/>2<br/>3<br>4<br/>5</br></p>
, enclosing 4 and 5 in the non-existant br-tag-pair. I have no idea to why this happens. This would be a bug to be reported at the BS launchpad: https://bugs.launchpad.net/beautifulsoup/ I'm afraid that we cannot really do anything for you in this case :(
Keep us updated if you learn something new! Best
"<p>11111<br>22222<br>33333<br/>44444<br><55555</p>"
Did you notice? there's an extra "<" before the fives.
Oops, that's just a mistyping of my own! Unfortunately, it has no effect on the result.
@isaring: you can convert your markdown yourself using the html5lib parser and use markdownify.MarkdownConverter to convert your html (see https://replit.com/@mathieud/DependentThinDowngrade#main.py).
import bs4
from markdownify import markdownify, MarkdownConverter
assert bs4.__version__ == '4.9.0' # using lowest possible version
html = "<p>11111<br>22222<br>33333<br/>44444<br><55555</p>"
# Using html.parser
soup = bs4.BeautifulSoup(html, 'html.parser')
assert "<br>" in str(soup)
assert markdownify(html) == '11111 \n22222 \n33333 \n\n\n'
# Using html5lib parser
soup = bs4.BeautifulSoup(html, 'html5lib')
assert "<br>" not in str(soup)
assert MarkdownConverter().convert_soup(soup) == '11111 \n22222 \n33333 \n44444 \n<55555\n\n'
@AlexVonB: So it's not a bs4 issue, it's a parser problem. So should the parser at https://github.com/matthewwithanm/python-markdownify/blob/develop/markdownify/__init__.py#L96 be upgraded to html5lib
for the next release of markdownify?
There indeed seems to be some kind of bug in the html.parser
parser. I think there is a heuristic that tries to identify the <br>
/<br/>
convention of the content, because if only one style is used, then it seems to be parsed properly:
>>> print(bs4.BeautifulSoup('1<br>2<br>3', 'html.parser'))
1<br/>2<br/>3
>>> print(bs4.BeautifulSoup('1<br/>2<br/>3', 'html.parser'))
1<br/>2<br/>3
But if a mix is used, then html.parser
seems to get confused:
>>> print(bs4.BeautifulSoup('1<br>2<br/>3', 'html.parser'))
1<br/>2<br>3</br>
whereas the other parsers do not:
>>> print(bs4.BeautifulSoup('1<br>2<br/>3', 'html5lib'))
<html><head></head><body>1<br/>2<br/>3</body></html>
^^^^^^^^^^^^^
>>> print(bs4.BeautifulSoup('1<br>2<br/>3', 'lxml'))
<html><body><p>1<br/>2<br/>3</p></body></html>
^^^^^^^^^^^^^
Beautiful Soup tries to choose the best available HTML parser by default:
>>> print(bs4.BeautifulSoup('1<br>2<br/>3'))
<html><body><p>1<br/>2<br/>3</p></body></html>
^^^^^^^^^^^^^
It might be best to use its default behavior by default, but implement a Markdownify option that allows a particular parser to be explicitly requested.
Hi,
I'm facing an issue with line-breaks tags when they are written like
<br/>
instead of<br>
.Considering this simple example:
Expected:
Actual:
My workaround is to
.replace('<br/>','<br>')
but it's a little pity...Could you fix this in a future release?
Regards,