mankyd / htmlmin

A configurable HTML Minifier with safety features
https://htmlmin.readthedocs.org/en/latest/
Other
129 stars 41 forks source link

pre_attr does not work as expected #17

Closed markfink closed 6 years ago

markfink commented 9 years ago

I am maintainer of buccaneer a static site generator. I have a "minify" plugin that uses htmlmin. During testing we discovered an issue.

Our expectation was that htmlmin would leave the href attribute intact (given appropriate config):

<a class="reference external" href="mailto:sam&#37;&#52;&#48;sample&#46;com">sam<span>&#64;</span>sample<span>&#46;</span>com</a>

...but it modifies the href attribute anyway:

<a class="reference external" href="mailto:sam%40sample.com">sam<span>&#64;</span>sample<span>&#46;</span>com</a>

here the code that calls into htmlmin:

            logger.debug('Minifying: %s' % filename)
            compressed = min(uncompressed, remove_comments=True,
                             remove_all_empty_space=True,
                             remove_empty_space=True,
                             keep_pre=True,
                             pre_attr='href')
            f.write(compressed)

please let me know what you think. Thank you.

oprypin commented 9 years ago

pre_attr works as described in documentation, not as you guessed it might work.

markfink commented 9 years ago

@BlaXpirit : not helpful! I guess you are talking about the API Reference. Here is what it says:

pre_attr – Specifies the attribute that, when found in an HTML tag, indicates that the content of the tag should not be minified. Defaults to pre.

mankyd commented 9 years ago

I will need to think on this.

I can see where the confusion comes from. The documentation can definitely be made clearer, perhaps with an example. The pre_attr is meant to relate the the contents of the tag (i.e. "sam" in your example) not the value of attributes. Maybe it makes sense to have some similar allowance for attributes.

I can think of a few ways to do this, none of which I love. One would be a list of attributes which should not be touched. Another would be a special attribute prefix which would get stripped off. So pre-href="..." could simply become href="...".

However, none of this actually addresses the source of the problem which lies outside of htmlmin. In fact, this problem comes from Python's HTMLParser library itself (https://docs.python.org/2/library/htmlparser.html). Notice what happens in this example:

>>> from HTMLParser import HTMLParser
>>> class P(HTMLParser):
...   def handle_starttag(self, tag, attrs):
...     print tag, attrs
... 
>>> p = P()
>>> p.feed('<a href="mailto:sam&#37;&#52;&#48;sample&#46;com">test</a>')
a [('href', u'mailto:sam%40sample.com')]

In this case, htmlmin is in fact completely unaware of the fact that hex-encoded character references were even used. HTMLParser is interpreting them beforehand. It has a callback to deal with character references, but it doesn't seem to be called for attribute values in my quick testing.

Like I said, I'll need to think about this problem. I am open to discussion and opinions.