Closed stevenvachon closed 8 years ago
Do you have a link for htmlparser2? Searched github without success.
Cool, I'll mark it as "possible feature" for now. Would be interesting to see if there's any deviations (or improvements) from our current parser and how faster it really is.
Relatedly, we'd be able to see performance difference easier after we're done with #143
:+1:
The capability to stream results is solid, but the output format precludes some of the stuff we're doing with customAttrSurround
. I'm not sure it would be possible to integrate this in a way that would be backwards-compatible for htmlparser2's existing consumers.
Perhaps we should consider renaming this to "switch to parse5": https://www.npmjs.com/package/parse5
I spent a little time writing up a proof-of-concept extensible html tokenizer. https://github.com/duncanbeevers/skunks
It's notable difference from projects like htmlparser2 and parse5 is that the tokenizer is built from tokenization rules at run-time rather than relying on a hard-coded tokenizer.
This extensibility means that adding new rules for custom dialects is supported directly and can be accomplished through simple extension of the base tokenizers.
As an example, I've included an Angular parser that recognizes elements like <ng:include>
, while keeping support for these freaky nodes out of the core html tokenizer.
This is more important than you may think. HTML emails are often delivered by designers with invalid/incorrect source code and would be auto-corrected (as a browser would) with parse5. I realize that switching to this will likely cause issues for minification of handlebars/mustache and downlevel-revealed IE comments, but that shouldn't be a priority in my opinion because:
You could use my handlebars-html-parser in the near future (as I'll be finishing it soon) if such is important to you.
Using parse5 would also denecessitate the removeRedundantAttributes
option.
html-uglify could catch up and you're toast.
I glanced at html-uglify and it seems to have a long way to go until it can reach our functionality, unless I'm missing something. Is there anything it does better?
How is parse5 better than htmlparser2? Both seem to be pretty well maintained, used, and compliant.
html-uglify is using cheerio, so they can add features very quickly and with less effort, despite less performance due to excessive searching. It's also quite actively maintained. See htmlnano.
parse5 parses HTML like a browser (100% compliance), so it corrects many errors which can cause errors elsewhere (such as CSS selectors).
htmlparser2 is supposedly a "forgiving" parser as well. I'm all for switching to a better engine but I don't see a clear advantage of parse5 as of now.
As I said regarding parse5, its compliance corrects invalid HTML, which is very useful to emails which are so funky to put together.
Given my recent experience hunting down bugs, the built-in htmlparser.js
actually helps a lot, as not only is it forgiving to non-compliant inputs, minify()
has various "hacks" that rely on the exact behaviour of the parser.
Why not use both, then? Use parse5 to correct the structure and rely on its location info to compare with htmlparser. This will allow for parsing things like downlevel-revealed IE comments [despite using parse5].
Downlevel-revealed comments have been parsed and optimised even before #524 - and we now gained minification of downlevel-hidden ones as well.
What performance gain do we get if we use both HTML parsers?
Not a performance gain, correct HTML parsing.
<table><a>asdf</a><tr><td></td></tr></table>
asdf<p>asdfadsf<p><div>asdf</div>
<body class="class1">
<body class="class2">
What do those code snippets do in real life scenarios, and how do they go wrong after minify()
which affects the rendering in a web browser?
As a quick test, I've thrown the 3 examples into http://kangax.github.io/html-minifier/ and they all seem to give fairly sensible results.
in the first example, the <a>
should've moved to outside the <table>
.
the second example should've closed the last <p>
before the <div>
, instead of encapsulating it.
the third example should've merged the two <body>
elements.
I think that you're not up to speed on spec-compliant HTML parsing.
But will the output cause a web browser to render differently from the input?
We are hacking the HTML specs after all...
No, parse5 parses HTML exactly as the browser does. Any hacked specs should be done around a spec compliant parser to ensure that everything else is corrected.
If the output of minify()
is going to be parsed and rendered identically in a web browser as the original input, why does it matter? This is just one of the many ways to perform HTML minifications - others for instance just use a bunch of regex's to get the job done.
After all we aim to reduce the size of an input without (in majority of cases) visual difference when displayed on the web.
Take #157 for instance - if we were to parse the HTML exactly, then the <html>
within <head>
would get merged with the root <html>
, nullifying the use of this IE hack. Whereas leaving it as is whilst aiming for other areas to minify would give the desirable result.
It would not get merged because they are within downlevel-hidden conditional IE comments.
Oh, yes, the bubbling. A special conditional comment parser would need to be written to handle many of these cases. I had mentioned something to this effect earlier by comparing with parse5's location info.
The main question is — what are the actual cases where parse5 acts as a browser and html-minifier's now-heavily-tweaked-and-improved parser doesn't?
I've listed 3 examples above.
Also listed above, parse5 automatically performs removeRedundantAttributes
, just as a browser does. There're more cases, but I don't have them all memorized.
I think the only example that would matter is if the input and output from minify()
gives different results on a web browser.
I have already mentioned a case whereby parse5
or some other "standard compliant" approach would break that rule. Can you provide an example where parse5
would produce a functional output whilst minify()
wouldn't?
Because if not, then what we will end up doing is fixing up all the corner cases where parse5
breaks, when we have a perfectly functional solution right now. Right now by being tolerant we are actually allowing weird and wonderful hacks of HTML as others try to workaround various web browser implementation differences. If we approach this the other way round as you suggested, we will be forever patching the parser as new hacks are discovered.
My second example (<p>
one) gives different rendered results due to the use of a non-spec-compliant parser.
I doubt that we'll have new hacks. We have a standard that we follow and IE conditional comments are not supported in IE10+. Anything added to the spec will be added to parse5.
So just to be on the same page, I assume you mean this example:
asdf<p>asdfadsf<p><div>asdf</div>
I just tried it here and it gives the exact same output. So where does that give different rendered results?
That's because you have removeOptionalTags=true
, which is not a default setting in the provided package.
Well I'm sure Boostrap would be delighted to know this page is about to get obsolete?
You're referring to CSS/JS hacks. What does that have to do with an HTML parser?
Here's the example before minify()
:
Turning all the options off and minify()
:
I can't say I can see any difference in rendered content...
It affects css selectors. body > div
would no longer work because the structure has been changed to body > p > div
. It's incorrect.
We don't browse the web daily by reading up and down DOM trees, do we? :sweat_smile:
It's up to you if you want your library to be accurate and complete.
I know that page is for CSS-related browser bugs, but the point is that software will always have bugs, and there are multiple web browsers that people use.
Historically, web developers employ clever tricks to work around platform-specific behaviours. I can't see why this situation is going to change any time soon.
This is a bug. This minifier will produce output that will render differently than its input because it affects CSS selectors. It will also affect document.querySelector()
/.querySelectorAll()
.
You don't sound very interested in fixing it, so I'll just close it and move on to htmlnano which, through posthtml, has plans to switch to parse5.
The solution is not very obvious here. Ideally, of course, we'd want markup to stay the same. But minification has consequences. Take, for example, removal of attributes — certain CSS will no longer function the same, certain querying by selectors will no longer function the same, and so on.
If there's a way for us to adhere to the same markup without too much hassle, we should try do it. /cc @alexlamsl
@kangax removal of attributes is optional.. changing rendered html structure currently is not, but it is doing it.
Good point, even more reason to fix it.
On Fri, Mar 4, 2016 at 7:06 PM, Steven Vachon notifications@github.com wrote:
@kangax https://github.com/kangax removal of attributes is optional.. changing rendered html structure currently is not, but it is doing it.
— Reply to this email directly or view it on GitHub https://github.com/kangax/html-minifier/issues/155#issuecomment-192531622 .
@kangax sure, file an issue which describe this class of parser bugs and I'll have a look at them. :wink:
I am not against fixing bugs in htmlparser
, obviously. However I am not convinced of solutions which involves breaking existing use cases.
I never said that anything had to break permanently. Sure, things may break before a release, but how is that different than any other refactor?
It's faster and is what is used by cheerio. Maybe even use cheerio as well for some features.