htacg / tidy-html5

The granddaddy of HTML tools, with support for modern standards
http://www.html-tidy.org
2.7k stars 415 forks source link

escape-cdata doesn't work on <script> tags #659

Open dechamps opened 6 years ago

dechamps commented 6 years ago

Documentation states:

--escape-cdata Boolean (no if unset) This option specifies if Tidy should convert <![CDATA[]]> sections to normal text.

On current HEAD (f0438bd):

$ tidy --escape-cdata yes <<EOF
<?xml version="1.0" ?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml"><head><title></title><meta charset="utf-8" /></head>
<body>
<script><![CDATA[
    foo
]]></script>
</body></html>
EOF

Returns:

Info: Document content looks like XHTML5
No warnings or errors were found.

<?xml version="1.0"?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content=
"HTML Tidy for HTML5 for Linux version 5.7.0" />
<title></title>
<meta charset="utf-8" />
</head>
<body>
<script>
<![CDATA[
        foo
]]>
</script>
</body>
</html>

The CDATA section is still there.

geoffmcl commented 6 years ago

@dechamps thank you for the issue, but am unsure exactly what you are pointing out here... that is what change are you suggesting...

I am not sure I fully understand CDATA Sections, but searching around, it seems mainly related to XML, the simplest definition I found was "CDATA marked sections are the preferred method for entering verbatim text in an SGML document."

Using the option --show-body-only yes on the following input -

<body>
<![CDATA[
    <a href="#foo">bar</a>
]]>

will give a virtually unchanged output of -

Info: Document content looks like HTML5
No warnings or errors were found.

<![CDATA[
        <a href="#foo">bar</a>
]]>

Now adding the option --escape-cdata yes will change that output to -

   &lt;a href="#foo"&gt;bar&lt;/a&gt;

That is more or less as the docs stated. The <![CDATA[ ... ]]> block has been removed, and the markup text escaped... i.e. the idea of escaped cdata...

Of course I agree the current docs description leaves something to be desired... could/should be expanded to explain this more clearly... welcome ideas, patches, PR to do that... thanks...

Now, if as you have done, you put the <![CDATA[ ... ]]> block inside a <script> ... </script> block, then tidy does not see this as a CDATA Section... it is a script tag, with more or less preformatted text...

Script, and style tags, and maybe some others, already tell tidy to not mess too much with the text content of these blocks... and the option --escape-cdata just does not apply to these blocks...

Then in your sample, you add the complication of adding an xml 1.0 header, followed by a html5 doctype, followed by an html tag with xmlns attribute of xhtml... but maybe this is covered in your previous #657 and #658 issues, so will try to address that there...

And then I see your next #660 also deals with tidy adding CDATA around <script> text if tidy is outputting XHTML5, but as you point out, this can be overcome by -ashtml... but again will try to deal with that there...

Anyway, does this answer why "escape-cdata doesn't seem to work"? It does change CDATA sections, but not script, style, etc text... Or have I missed the point somewhere...

Please explain more if so... thanks...

dechamps commented 6 years ago

I think we're on the same page.

Script, and style tags, and maybe some others, already tell tidy to not mess too much with the text content of these blocks... and the option --escape-cdata just does not apply to these blocks...

Ah, interesting. I only tested CDATA sections in <script> tags. I wasn't aware that --escape-cdata does work on some tags, just not that one. So I guess a more accurate bug summary is "escape-cdata doesn't seem to work on <script> tags", which is unfortunate, since that's precisely on that sort of tag that CDATA tends to get used and where that --escape-cdata option would be most useful!

It's a bit unfortunate that tidy tries to not mess with the content of <script> tags in some aspects (like these --escape-cdata sections) but then actively messes with them in other aspects (see #660). It's weirdly inconsistent, and in the present case it means that --escape-cdata cannot be used to "fix" the problems introduced by #660.