htacg / tidy-html5

The granddaddy of HTML tools, with support for modern standards
http://www.html-tidy.org
2.71k stars 417 forks source link

enclose-block-text adds p inside noscript inside p #905

Open jtojnar opened 3 years ago

jtojnar commented 3 years ago

<noscript> has transparent content model so adding paragraph inside is just like adding paragraph directly inside the outer paragraph. <p> elements cannot contain block elements like other paragraphs so tidy produces an invalid HTML:

$ echo '<p><noscript>test</noscript></p>' | result/bin/tidy --enclose-block-text yes
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 1 column 1 - Warning: inserting implicit <body>
line 1 column 1 - Warning: inserting missing 'title' element
Info: Document content looks like HTML5
Tidy found 3 warnings and 0 errors!

<!DOCTYPE html>
<html>
<head>
<meta name="generator" content=
"HTML Tidy for HTML5 for Linux version 5.7.35">
<title></title>
</head>
<body>
<p><noscript>
<p>test</p>
</noscript></p>
</body>
</html>
geoffmcl commented 3 years ago

@jtojnar thank you for your issue... but I need more information...

I am not familiar with the enclose-block-text option, its use case... what is it supposed to do? I can read the quickref docs, and it seems tidy is doing what it is supposed to do, with this option set on...

But I also certainly agree putting a <p> inside another <p> is invalid...

But what do you expect the tidy output to be in this case? Please show, explain...

Look forward to further feedback... maybe more sample code... etc... thanks...

jtojnar commented 3 years ago

I would expect the option to wrap text nodes in articles, divs and other block elements inside paragraphs. For example, <div>text</div> would be converted into <div><p>text</p></div>. I would even expect <div><noscript>text</noscript></div> to be converted into <div><noscript><p>text</p></noscript></div> or <div><p><noscript>text</noscript></p></div>. I would expect <p><noscript>test</noscript></p> to be left as is.

But reading the comment accompanying the implementation:

https://github.com/htacg/tidy-html5/blob/86b52dc1081ca4b0582c7bad279bf254bad268e1/src/parser.c#L4505-L4507

it looks like the purpose of the option is to avoid having #PCDATA in noscript elements, not introduce paragraphs everywhere. In that case I would suggest using span or other inline element when p or other closing element is one of noscript’s parents. But looking at the spec, HTML 4 actually only allows block elements as children of noscript so the element cannot really be inside p. Actually, it cannot be inside p because it is considered a block element in HTML 4 and p only supports inline elements.

So I guess the only solution is not to use EncloseBlockText since it does not do what we would want and does what we do not want.

gwern commented 1 year ago

I don't know if this is connected, but I ran into an issue with tidy & noscript elements: <style> tags are by default hoisted out of <noscript> tags and into the <head>. This makes the style contents apply unconditionally to the web page, and not solely when JS is disabled. This is almost always wrong and will break things when JS is enabled (otherwise, the style tag would not have been in the noscript to begin with). It is unclear how to solve this other than to add --fix-style-tags no.


In my case, I have been adding warnings to JS-less readers that they are experiencing a highly-degraded version of my website. The existing noscript warning, which is a simple little div at the top & bottom of the page with text describing the missing features, turns out to be inadequate because it is not obtrusive enough and a reader will often jump to a specific section or anchor link on one of my pages. To avoid both, I decided to float the warning, and make it a sticky using some CSS. The CSS in question could be added to a .css file or inlined on the warning, and it works fine, because the JS reader never sees it; unfortunately, it looks ugly inasmuch as it will overlap the header at the top of the page.

So, I added some padding to the top header, also wrapped in the noscript tags. So it's like <noscript> ...warning... <style>header { padding-top: 50px; }</style> </noscript>. Simple, logical, seemed to work exactly as desired when I tried editing live HTML to check.

But breaks with Tidy, because tidy apparently sees the <style> tags sitting there in the body, and relocates it to the head - but out of context, without the noscript tag. The noscript tag is left in place, partially emptied. This means that JS readers now see a 50px gap at the top of the page for no reason. This was a surprising interaction to me; I was unaware Tidy did such a thing because it hadn't come up before, and even if I had memorized the Tidy rewrites, I don't think it would've occurred to me that any such moving would either happen inside a noscript or leave behind the noscript and cause this failure.


There is no obvious-to-me way to fix it, short of disabling the feature entirely with --fix-style-tags no. (The noscriptness of style tags can't be screwed up if Tidy is ordered to leave style tags alone completely, presumably.) I cannot put the inline style on the header or in my current CSS files, because then there's no way to conditionally enable/disable it. Reading the spec for <noscript> it seems it's now valid in HTML5 to put a noscript inside the head? So presumably I'd be able to avoid the tidy rewrite by splitting the noscript and moving the style part to the head. More complicated and indirect and I'm not sure it'd work. I might also work around it by creating an entire separate CSS file just for styles only applying in no-JS appearances and noscripting a <link> of that? Not really a fan of that either (it'd be literally two short declarations).

Maybe tidy should either skip things inside noscript tags, or if they are in a noscript, make sure to put them in a noscript wherever they are copied?