mankyd / htmlmin

A configurable HTML Minifier with safety features
https://htmlmin.readthedocs.org/en/latest/
Other
129 stars 41 forks source link

HTML Entities get decoded when minifying #29

Closed Tenzer closed 8 years ago

Tenzer commented 8 years ago

Hi, I have a problem where htmlmin really can screw up a site. Consider this:

Python 3.5.0 (default, Sep 14 2015, 02:37:27)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from htmlmin import minify
>>> minify('<code>&lt;script&gt;</code>')
'<code><script></code>'

It's pretty dangerous to have the escaped, safe, tag get unescaped, as the rest of the site then will be swallowed into the <script> tag, unless you have a end tag further down the page.

I don't know if this, like #17, is down to the Python HTML parser trying to be clever, and decodes it automatically without htmlmin knowing about it, or if it's something that can be solved.

Tenzer commented 8 years ago

A bit more digging and it's apparently related to Python 3.5. From that version the convert_charrefs parameter on html.parser.HTMLParser() has changed the default value from False to True. I'll post a pull request to change this to false on Python 3.5 and never in a second.