htacg / tidy-html5

The granddaddy of HTML tools, with support for modern standards
http://www.html-tidy.org
2.7k stars 415 forks source link

Tidy should not output unguarded CDATA sections #660

Open dechamps opened 6 years ago

dechamps commented 6 years ago

On current HEAD (f0438bd):

$ tidy <<EOF
<?xml version="1.0" ?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml"><head><title></title><meta charset="utf-8" /></head><body>
<script>foo</script>
</body></html>
EOF

Returns:

Info: Document content looks like XHTML5
No warnings or errors were found.

<?xml version="1.0"?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content=
"HTML Tidy for HTML5 for Linux version 5.7.0" />
<title></title>
<meta charset="utf-8" />
</head>
<body>
<script>
<![CDATA[
foo
]]>
</script>
</body>
</html>

Tidy decided to wrap the contents of the <script> element inside a CDATA section. This is fine from a pure XML perspective, but in practice, browsers tend to not like that at all since they will typically parse the document as HTML, not XHTML (that's true for at least modern Firefox). They will not recognize the <![CDATA[ sequence and will try to parse it as JavaScript, leading to parse errors and a broken script.

I think the most straightforward way to solve this problem is to forget about CDATA sections and simply escape the <script> contents as entity-escaped XML (related: #659).

Alternatively, the standard trick is to "guard" the CDATA opening and closing tags by commenting them out in JavaScript, like so:

<script>
// <![CDATA[
foo
// ]]>
</script>

For now, the only workaround is to output HTML (-ashtml), which doesn't have this problem.

utrenkner commented 6 years ago

I would like to confirm: This is a problem also for us. Last week, we switched an old website to the new(er) way of including Google Analytics, which is via Google's tag manager. This is part of the code to include in each web page:

<script>
  window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'UA-12345678-1');
</script>

And all of the sudden, Google Analytics worked only intermittently. I then noticed that the code between <script>.. </script> had been escaped with a CDATA section. And both, Chrome and Firefox, reported this is a JavaScript error. It took me quite some time to figure out which tool included this CDATA code: tidy-html5.

I followed @dechamps suggestion to add -ashtml which effectively solved the problem for us. However, it would be great if tidy could get an option to not add this CDATA code in the first place.

jcubic commented 5 years ago

@utrenkner With JavaScript it's no problem you just have to use type="text/javascript".

EDIT:

The problem I'm having is JSON-LD

  <script type="application/ld+json">
  {
  "@context": "http://schema.org",
  "@type": "BlogPosting",
  "headline": "15 Pytań na rozmowę rekrutacyjną z CSS",
  "image": {
    "@type": "ImageObject",
    "url": "/img/css-interview.jpg",
    "width": 800,
    "height": 464
  }
  }
  </script>

and I want to have xhtml self closing tags on output html, so -ashtml option don't work.

markafoltz commented 5 years ago

tidy-html5 also breaks inline script used to customize Respec (W3C spec tooling). -ashtml still inserts the tags on version 'next'.

Ref: https://github.com/w3c/presentation-api/blob/gh-pages/Makefile

geoffmcl commented 5 years ago

@dechamps sorry for the rather belated response... somehow I overlooked this... perhaps related to #659...

This certainly seems an issue related to DOCTYPE... diff HTML5 and previous versions, and HTML versus XHTML... AND diff PCDATA, Parsed Character Data and CDATA, Unparsed Character Data...

Tidy sees the input sample as XML, as XHTML5 is its declaration...

The -ashtml option requests HTML output, so the CDATA guard will not be added... this seems correct...

Then this request to stop tidy inserting CDATA in XHTML seems a problem! Why do that? Spec/Ref links needed...

And maybe even more of a problem, further inserting // javascript comments, as well... again Spec/Ref links needed...

Saw a suggestion somewhere where this too could be in a comment, like <!--//-->, to maintain XHTML... wow... do not think so...

In HTML5, you can put just about anything you like in <script>, like -

<script>
var v = 5;
if ( v < 10 && ((v & 0x1) == 1) ) {
        // do something
}
</script>

But NOT if you want XHTML conformance... without escaping, at least the < and & chars...

Unless I am mis-reading this request, or have missed some W3C REF, the best solution, as @mfoltzgoogle points to, is to accept HTML output...

Again sorry for the delay... @mfoltzgoogle comment jogged the memory...

Look forward to further feedback, comments, patches, PRs, etc to add to this... thanks...

markafoltz commented 5 years ago

Unless I'm missing something, the document I'm tidying [1] is not XHTML:

<!DOCTYPE html>

So I don't expect tidy to process it as XML and insert CDATA markup for inline script.

[1] https://w3c.github.io/presentation-api/

Magiweb commented 5 years ago

Confirm I'm getting the same thing happening. It actually breaks Facebook's standard tracking pixel script. Commenting the added CDATA tags just causes a JSON parse error from FB, so that doesn't work, and as @geoffmcl pointed out, is not a good solution.

Agree with @mfoltzgoogle - my doctype is set to html, so surely it shouldn't be processing it as XHTML?

geoffmcl commented 5 years ago

@mfoltzgoogle well I too seem missing something... ;=))

When I wget the link you gave, and process it with tidy, it correctly treats it as HTML5 ...

Some warning that do not appear exactly right... link-for... but that can be a separate issue...

And does not add any <![CDATA[ ... ]]> to the script, or style, tags... so can not repeat the problem, with this sample... need a small, consistent, sample that shows this...

However, Tidy will add the <![CDATA[ ... ]]> blocks, if requested with an option - -asxml, -asxhtml, output-xhtml: yes ie convert HTML to well formed XHTML - then you get the XHTML you asked for... what can be wrong with this? What am I missing???

Just reading the parser.c code, the criteria for setting output to XHTML, is also finding like <html xmlns="http://www.w3.org/1999/xhtml">... as in @dechamps first sample... and not just the DOCTYPE... which it seems can just about be anything...

@Magiweb, agree that just the DOCTYPE should not cause a document to be output as xhtml... can you supply a small sample where it seems to?

So still in the dark as to exactly what is the problem, the request, here... need more info, explanation... samples, with config used, etc, etc...

As stated, look forward to further feedback, comments, patches, PRs, etc to add to this... thanks...

petko commented 4 years ago

It can be a problem is situations where for example you need both the XML parsing of an XHTML file (so we use the output-xhtml: yes option), but you also need the same file to be read properly by some older off-screen readers, which do not understand the CDATA sections properly.

So maybe something like an option guard-cdata can be useful here?

ghost commented 1 year ago

Additionally, there are cases where, as a developer, you want to process an HTML document using XSLT, which requires a well-formed XML document. However, you don't want to include the <![CDATA section because it breaks browsers. This results in a situation that is seemingly impossible to resolve: you can't output HTML because the HTML is not well-formed (e.g., the <br> and <meta> tags are not self-closing) and you can't output as XML because of the <![CDATA section.

It would be nice to have a bit more control over whether <![CDATA is written when using XML (or XHTML) as the output format.

Aside, in our situation, we have full control over the <style> section of the HTML page, so we know the <![CDATA instruction is superfluous. Meaning, although using // <![CDATA would solve the problem, we'd prefer to suppress it altogether.

linusjf commented 5 months ago

This is additionally an issue when using type="importmap". That really screws up my React components.