mankyd / htmlmin

A configurable HTML Minifier with safety features
https://htmlmin.readthedocs.org/en/latest/
Other
130 stars 40 forks source link

`&` not handled correctly inside urls #64

Open fekir opened 2 years ago

fekir commented 2 years ago

Consider

<!DOCTYPE html>
<html lang="en">
        <head>
                <meta charset="utf-8">
                <title>Hello World</title>
        </head>
        <body>
                <p>Some text with an <a href="https://example.com/something&amp;search=test">url</a></p>
                <p>Some text with another <a href="https://example.com/something&search=test">url</a></p>
        </body>
</html>

htmlmin index.html generates

<!DOCTYPE html><html lang=en> <head><meta charset=utf-8><title>Hello World</title></head> <body> <p>Some text with an <a href="https://example.com/something&search=test">url</a></p> <p>Some text with another <a href="https://example.com/something&search=test">url</a></p> <p>And other text with another <a href="https://example.com/something%26search=test">url</a></p> </body> </html>

Notice how the first url changed from https://example.com/something&amp;search=test to https://example.com/something&search=test.

This is technically not correct, as the ampersand character declares the beginning of an entity reference. Also the w3 guidelines suggest to use &amp; instead of & in this context.

Validators like https://github.com/htacg/tidy-html5/issues/1017 correctly complain about the invalid entity reference.