htacg / tidy-html5

The granddaddy of HTML tools, with support for modern standards
http://www.html-tidy.org
2.72k stars 419 forks source link

drop-empty-elements does not drop elements left empty as a result of hide-comments #828

Open leppert opened 5 years ago

leppert commented 5 years ago

I would expect the following two inputs to produce equivalent output. However, the second input, in which a comment is hidden, returns an empty <p> tag.

$> echo '<p></p>' | tidy --drop-empty-elements true --hide-comments true

<!DOCTYPE html>
<html>
<head>
<meta name="generator" content=
"HTML Tidy for HTML5 for Apple macOS version 5.7.24">
<title></title>
</head>
<body>
</body>
</html>

$> echo '<p><!-- foo --></p>' | tidy --drop-empty-elements true --hide-comments true

<!DOCTYPE html>
<html>
<head>
<meta name="generator" content=
"HTML Tidy for HTML5 for Apple macOS version 5.7.24">
<title></title>
</head>
<body>
<p></p>
</body>
</html>
geoffmcl commented 5 years ago

@leppert thank you for the issue...

Yes, it is sort of strange...

Can only guess the drop-empty-element occurs, in code, before the hide-comments... In other words, at the time empty is checked, the comment is still there... and tidy does not re-process the document, after the hide... or something...

Will try to look further into it...

Meantime, look forward to further feedback... including patches, PR, ... thanks...

geoffmcl commented 5 years ago

@leppert with more testing and debugging, it is more or less as I thought...

As you may know tidy has sort of three phases of operation - parsing, cleaning, output

The drop-empty-element is in the parsing stage...

The hide-comments is in the output stage...

And there is no re-parsing of the DOM like tree of nodes, at this stage, looking for now potentially empty elements, to drop them... so they stay...

While this is possible, it requires someone to add the code to do this... see DropEmptyElementa and then DropComments, for example...

Is there sufficient use case the attack this? Feels quite minor...

Look forward to further feedback... including patches, PR, ... thanks...

leppert commented 5 years ago

@geoffmcl thanks for the update and for looking further into the issue. I agree -- it's hard to imagine this being a large scale problem for users unless there are other, perhaps more common circumstances where nodes are removed in the output stage. If I find a moment, I'll certainly attempt to submit a PR. Regardless, perhaps this ticket might prove useful should the pipeline and its order of phases ever get reworked for another reason; a bug of this small size is probably best fixed "for free" in the process of serving a larger goal.