Should not escape HTML entities

rennat / pynliner

Python CSS-to-inline-styles conversion tool for HTML using BeautifulSoup and cssutils

180 stars 93 forks source link

Should not escape HTML entities #43

Closed listen5k closed 7 years ago

listen5k commented 8 years ago

First off, thank for the great library. Recently I upgraded the lib from 0.5.1 to 0.7.1 and now it breaks my emails.

How to reproduce

$ python -V
Python 2.7.6
$ pip freeze | grep pynliner
pynliner==0.7.1
$ python -c "import pynliner; print pynliner.fromString('<p>&nbsp;</p>')"

Expected

<p>&nbsp;</p>

Actual

<p> </p>

rennat commented 8 years ago

Thanks for using it and reporting bugs! We recently upgraded to BeautifulSoup4 and it changed some behavior here but first please confirm this issue because I think you have a typo in your example code: &npsp; isn't a valid HTML entity. I'm assuming you meant   which does work as expected.

>>> import pynliner
>>> pynliner.fromString('<p>&nbsp;')
u'<p>\xa0</p>'
>>> print _
<p> </p>

listen5k commented 8 years ago

@rennat Correct. That was a typo. I updated the description.

I'm expecting that the lib would not convert HTML entities. Please elaborate if you disagree.

The reason I have to preserve   in my HTML is that Outlook needs to have them in an empty <td></td> for rendering a table correctly.

rennat commented 8 years ago

BeautifulSoup 4 doesn't offer the option to preserve HTML entities. It converts all of them to Unicode characters. See the pull request #45 for discussion.

ShadowKyogre commented 8 years ago

Hey everyone. Thought I'd pop into this discussion.

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-formatters

You can use the soup to restore html entities after BS4 slurps up the input string via a formatter. I tested it with the OP's example, but, are there any edge cases that the formatter might not capture?

>>> from bs4 import BeautifulSoup
>>> s = BeautifulSoup("<p>&nbsp;</p>", "html.parser")
>>> s.prettify(formatter="html")
'<p>\n &nbsp;\n</p>'

EDIT: In certain cases, this wouldn't be wanted if minified output is absolutely necessary.

rennat commented 7 years ago

fixed in 0.7.2