Switch to newer, faster HTML parser

stevenvachon commented 10 years ago

It's faster and is what is used by cheerio. Maybe even use cheerio as well for some features.

kangax commented 10 years ago

Do you have a link for htmlparser2? Searched github without success.

stevenvachon commented 10 years ago

https://github.com/fb55/htmlparser2/

kangax commented 10 years ago

Cool, I'll mark it as "possible feature" for now. Would be interesting to see if there's any deviations (or improvements) from our current parser and how faster it really is.

Relatedly, we'd be able to see performance difference easier after we're done with #143

sindresorhus commented 10 years ago

:+1:

duncanbeevers commented 9 years ago

The capability to stream results is solid, but the output format precludes some of the stuff we're doing with customAttrSurround. I'm not sure it would be possible to integrate this in a way that would be backwards-compatible for htmlparser2's existing consumers.

stevenvachon commented 9 years ago

Perhaps we should consider renaming this to "switch to parse5": https://www.npmjs.com/package/parse5

duncanbeevers commented 9 years ago

I spent a little time writing up a proof-of-concept extensible html tokenizer. https://github.com/duncanbeevers/skunks

It's notable difference from projects like htmlparser2 and parse5 is that the tokenizer is built from tokenization rules at run-time rather than relying on a hard-coded tokenizer.

This extensibility means that adding new rules for custom dialects is supported directly and can be accomplished through simple extension of the base tokenizers.

As an example, I've included an Angular parser that recognizes elements like <ng:include>, while keeping support for these freaky nodes out of the core html tokenizer.

stevenvachon commented 8 years ago

This is more important than you may think. HTML emails are often delivered by designers with invalid/incorrect source code and would be auto-corrected (as a browser would) with parse5. I realize that switching to this will likely cause issues for minification of handlebars/mustache and downlevel-revealed IE comments, but that shouldn't be a priority in my opinion because:

IE9 is not the future
It's no longer a good practice to include handlebars/mustache strings in production assets and we don't need to minify our dev templates

You could use my handlebars-html-parser in the near future (as I'll be finishing it soon) if such is important to you.

Using parse5 would also denecessitate the removeRedundantAttributes option.

html-uglify could catch up and you're toast.

kangax commented 8 years ago

I glanced at html-uglify and it seems to have a long way to go until it can reach our functionality, unless I'm missing something. Is there anything it does better?

kangax commented 8 years ago

How is parse5 better than htmlparser2? Both seem to be pretty well maintained, used, and compliant.

stevenvachon commented 8 years ago

~~html-uglify is using cheerio, so they can add features very quickly and with less effort, despite less performance due to excessive searching. It's also quite actively maintained.~~ See htmlnano.

parse5 parses HTML like a browser (100% compliance), so it corrects many errors which can cause errors elsewhere (such as CSS selectors).

kangax commented 8 years ago

htmlparser2 is supposedly a "forgiving" parser as well. I'm all for switching to a better engine but I don't see a clear advantage of parse5 as of now.

stevenvachon commented 8 years ago

It's forgiving, but it's not compliant. high5 is htmlparser2's complaint upgrade, but it will remove much if not all of the forgiving features, from what was mentioned here.

stevenvachon commented 8 years ago

As I said regarding parse5, its compliance corrects invalid HTML, which is very useful to emails which are so funky to put together.

alexlamsl commented 8 years ago

Given my recent experience hunting down bugs, the built-in htmlparser.js actually helps a lot, as not only is it forgiving to non-compliant inputs, minify() has various "hacks" that rely on the exact behaviour of the parser.

stevenvachon commented 8 years ago

Why not use both, then? Use parse5 to correct the structure and rely on its location info to compare with htmlparser. This will allow for parsing things like downlevel-revealed IE comments [despite using parse5].

alexlamsl commented 8 years ago

Downlevel-revealed comments have been parsed and optimised even before #524 - and we now gained minification of downlevel-hidden ones as well.

What performance gain do we get if we use both HTML parsers?

stevenvachon commented 8 years ago

Not a performance gain, correct HTML parsing.

<table><a>asdf</a><tr><td></td></tr></table>

asdf<p>asdfadsf<p><div>asdf</div>

<body class="class1">
<body class="class2">

alexlamsl commented 8 years ago

What do those code snippets do in real life scenarios, and how do they go wrong after minify() which affects the rendering in a web browser?

As a quick test, I've thrown the 3 examples into http://kangax.github.io/html-minifier/ and they all seem to give fairly sensible results.

stevenvachon commented 8 years ago

in the first example, the <a> should've moved to outside the <table>.

the second example should've closed the last <p> before the <div>, instead of encapsulating it.

the third example should've merged the two <body> elements.

I think that you're not up to speed on spec-compliant HTML parsing.

alexlamsl commented 8 years ago

But will the output cause a web browser to render differently from the input?

We are hacking the HTML specs after all...

stevenvachon commented 8 years ago

No, parse5 parses HTML exactly as the browser does. Any hacked specs should be done around a spec compliant parser to ensure that everything else is corrected.

alexlamsl commented 8 years ago

If the output of minify() is going to be parsed and rendered identically in a web browser as the original input, why does it matter? This is just one of the many ways to perform HTML minifications - others for instance just use a bunch of regex's to get the job done.

After all we aim to reduce the size of an input without (in majority of cases) visual difference when displayed on the web.

alexlamsl commented 8 years ago

Take #157 for instance - if we were to parse the HTML exactly, then the <html> within <head> would get merged with the root <html>, nullifying the use of this IE hack. Whereas leaving it as is whilst aiming for other areas to minify would give the desirable result.

stevenvachon commented 8 years ago

~~It would not get merged because they are within downlevel-hidden conditional IE comments.~~

Oh, yes, the bubbling. A special conditional comment parser would need to be written to handle many of these cases. I had mentioned something to this effect earlier by comparing with parse5's location info.

kangax commented 8 years ago

The main question is — what are the actual cases where parse5 acts as a browser and html-minifier's now-heavily-tweaked-and-improved parser doesn't?

stevenvachon commented 8 years ago

I've listed 3 examples above.

Also listed above, parse5 automatically performs removeRedundantAttributes, just as a browser does. There're more cases, but I don't have them all memorized.

alexlamsl commented 8 years ago

I think the only example that would matter is if the input and output from minify() gives different results on a web browser.

I have already mentioned a case whereby parse5 or some other "standard compliant" approach would break that rule. Can you provide an example where parse5 would produce a functional output whilst minify() wouldn't?

Because if not, then what we will end up doing is fixing up all the corner cases where parse5 breaks, when we have a perfectly functional solution right now. Right now by being tolerant we are actually allowing weird and wonderful hacks of HTML as others try to workaround various web browser implementation differences. If we approach this the other way round as you suggested, we will be forever patching the parser as new hacks are discovered.

stevenvachon commented 8 years ago

My second example (<p> one) gives different rendered results due to the use of a non-spec-compliant parser.

stevenvachon commented 8 years ago

I doubt that we'll have new hacks. We have a standard that we follow and IE conditional comments are not supported in IE10+. Anything added to the spec will be added to parse5.

alexlamsl commented 8 years ago

So just to be on the same page, I assume you mean this example:

asdf<p>asdfadsf<p><div>asdf</div>

I just tried it here and it gives the exact same output. So where does that give different rendered results?

stevenvachon commented 8 years ago

That's because you have removeOptionalTags=true, which is not a default setting in the provided package.

alexlamsl commented 8 years ago

Well I'm sure Boostrap would be delighted to know this page is about to get obsolete?

stevenvachon commented 8 years ago

You're referring to CSS/JS hacks. What does that have to do with an HTML parser?

alexlamsl commented 8 years ago

Here's the example before minify(): before

Turning all the options off and minify(): after

I can't say I can see any difference in rendered content...

stevenvachon commented 8 years ago

It affects css selectors. body > div would no longer work because the structure has been changed to body > p > div. It's incorrect.

alexlamsl commented 8 years ago

We don't browse the web daily by reading up and down DOM trees, do we? :sweat_smile:

stevenvachon commented 8 years ago

It's up to you if you want your library to be accurate and complete.

alexlamsl commented 8 years ago

I know that page is for CSS-related browser bugs, but the point is that software will always have bugs, and there are multiple web browsers that people use.

Historically, web developers employ clever tricks to work around platform-specific behaviours. I can't see why this situation is going to change any time soon.

stevenvachon commented 8 years ago

This is a bug. This minifier will produce output that will render differently than its input because it affects CSS selectors. It will also affect document.querySelector()/.querySelectorAll().

You don't sound very interested in fixing it, so I'll just close it and move on to htmlnano which, through posthtml, has plans to switch to parse5.

kangax commented 8 years ago

The solution is not very obvious here. Ideally, of course, we'd want markup to stay the same. But minification has consequences. Take, for example, removal of attributes — certain CSS will no longer function the same, certain querying by selectors will no longer function the same, and so on.

If there's a way for us to adhere to the same markup without too much hassle, we should try do it. /cc @alexlamsl

stevenvachon commented 8 years ago

@kangax removal of attributes is optional.. changing rendered html structure currently is not, but it is doing it.

kangax commented 8 years ago

Good point, even more reason to fix it.

On Fri, Mar 4, 2016 at 7:06 PM, Steven Vachon notifications@github.com wrote:

@kangax https://github.com/kangax removal of attributes is optional.. changing rendered html structure currently is not, but it is doing it.

— Reply to this email directly or view it on GitHub https://github.com/kangax/html-minifier/issues/155#issuecomment-192531622 .

alexlamsl commented 8 years ago

@kangax sure, file an issue which describe this class of parser bugs and I'll have a look at them. :wink:

I am not against fixing bugs in htmlparser, obviously. However I am not convinced of solutions which involves breaking existing use cases.

stevenvachon commented 8 years ago

I never said that anything had to break permanently. Sure, things may break before a release, but how is that different than any other refactor?

kangax / html-minifier

Switch to newer, faster HTML parser #155