Open dechamps opened 6 years ago
I would like to confirm: This is a problem also for us. Last week, we switched an old website to the new(er) way of including Google Analytics, which is via Google's tag manager. This is part of the code to include in each web page:
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'UA-12345678-1');
</script>
And all of the sudden, Google Analytics worked only intermittently. I then noticed that the code between <script>
.. </script>
had been escaped with a CDATA section. And both, Chrome and Firefox, reported this is a JavaScript error. It took me quite some time to figure out which tool included this CDATA code: tidy-html5.
I followed @dechamps suggestion to add -ashtml
which effectively solved the problem for us. However, it would be great if tidy could get an option to not add this CDATA code in the first place.
@utrenkner With JavaScript it's no problem you just have to use type="text/javascript"
.
EDIT:
The problem I'm having is JSON-LD
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "15 Pytań na rozmowę rekrutacyjną z CSS",
"image": {
"@type": "ImageObject",
"url": "/img/css-interview.jpg",
"width": 800,
"height": 464
}
}
</script>
and I want to have xhtml self closing tags on output html, so -ashtml
option don't work.
tidy-html5 also breaks inline script used to customize Respec (W3C spec tooling). -ashtml still inserts the tags on version 'next'.
Ref: https://github.com/w3c/presentation-api/blob/gh-pages/Makefile
@dechamps sorry for the rather belated response... somehow I overlooked this... perhaps related to #659...
This certainly seems an issue related to DOCTYPE... diff HTML5 and previous versions, and HTML
versus XHTML
... AND diff PCDATA, Parsed Character Data
and CDATA, Unparsed Character Data
...
Tidy sees the input sample as XML
, as XHTML5
is its declaration...
The -ashtml
option requests HTML
output, so the CDATA guard will not be added... this seems correct...
Then this request to stop tidy
inserting CDATA in XHTML
seems a problem! Why do that? Spec/Ref links needed...
And maybe even more of a problem, further inserting //
javascript comments, as well... again Spec/Ref links needed...
Saw a suggestion somewhere where this too could be in a comment
, like <!--//-->
, to maintain XHTML... wow... do not think so...
In HTML5, you can put just about anything you like in <script>
, like -
<script>
var v = 5;
if ( v < 10 && ((v & 0x1) == 1) ) {
// do something
}
</script>
But NOT if you want XHTML
conformance... without escaping, at least the <
and &
chars...
Unless I am mis-reading this request, or have missed some W3C REF, the best solution, as @mfoltzgoogle points to, is to accept HTML
output...
Again sorry for the delay... @mfoltzgoogle comment jogged the memory...
Look forward to further feedback, comments, patches, PRs, etc to add to this... thanks...
Unless I'm missing something, the document I'm tidying [1] is not XHTML:
<!DOCTYPE html>
So I don't expect tidy to process it as XML and insert CDATA markup for inline script.
Confirm I'm getting the same thing happening. It actually breaks Facebook's standard tracking pixel script. Commenting the added CDATA tags just causes a JSON parse error from FB, so that doesn't work, and as @geoffmcl pointed out, is not a good solution.
Agree with @mfoltzgoogle - my doctype is set to html, so surely it shouldn't be processing it as XHTML?
@mfoltzgoogle well I too seem missing something... ;=))
When I wget
the link you gave, and process it with tidy
, it correctly treats it as HTML5
...
Some warning that do not appear exactly right... link-for
... but that can be a separate issue...
And does not add any <![CDATA[ ... ]]>
to the script, or style, tags... so can not repeat the problem, with this sample... need a small, consistent, sample that shows this...
However, Tidy
will add the <![CDATA[ ... ]]>
blocks, if requested with an option - -asxml
, -asxhtml
, output-xhtml: yes
ie convert HTML to well formed XHTML - then you get the XHTML
you asked for... what can be wrong with this? What am I missing???
Just reading the parser.c code, the criteria for setting output to XHTML
, is also finding like <html xmlns="http://www.w3.org/1999/xhtml">
... as in @dechamps first sample... and not just the DOCTYPE
... which it seems can just about be anything...
@Magiweb, agree that just the DOCTYPE
should not cause a document to be output as xhtml
... can you supply a small sample where it seems to?
So still in the dark as to exactly what is the problem, the request, here... need more info, explanation... samples, with config used, etc, etc...
As stated, look forward to further feedback, comments, patches, PRs, etc to add to this... thanks...
It can be a problem is situations where for example you need both the XML parsing of an XHTML file (so we use the output-xhtml: yes
option), but you also need the same file to be read properly by some older off-screen readers, which do not understand the CDATA
sections properly.
So maybe something like an option guard-cdata
can be useful here?
Additionally, there are cases where, as a developer, you want to process an HTML document using XSLT, which requires a well-formed XML document. However, you don't want to include the <![CDATA
section because it breaks browsers. This results in a situation that is seemingly impossible to resolve: you can't output HTML because the HTML is not well-formed (e.g., the <br>
and <meta>
tags are not self-closing) and you can't output as XML because of the <![CDATA
section.
It would be nice to have a bit more control over whether <![CDATA
is written when using XML (or XHTML) as the output format.
Aside, in our situation, we have full control over the <style>
section of the HTML page, so we know the <![CDATA
instruction is superfluous. Meaning, although using // <![CDATA
would solve the problem, we'd prefer to suppress it altogether.
This is additionally an issue when using type="importmap". That really screws up my React components.
On current HEAD (f0438bd):
Returns:
Tidy decided to wrap the contents of the
<script>
element inside a CDATA section. This is fine from a pure XML perspective, but in practice, browsers tend to not like that at all since they will typically parse the document as HTML, not XHTML (that's true for at least modern Firefox). They will not recognize the<![CDATA[
sequence and will try to parse it as JavaScript, leading to parse errors and a broken script.I think the most straightforward way to solve this problem is to forget about CDATA sections and simply escape the
<script>
contents as entity-escaped XML (related: #659).Alternatively, the standard trick is to "guard" the CDATA opening and closing tags by commenting them out in JavaScript, like so:
For now, the only workaround is to output HTML (
-ashtml
), which doesn't have this problem.