htacg / tidy-html5

The granddaddy of HTML tools, with support for modern standards
http://www.html-tidy.org
2.7k stars 415 forks source link

Tidy wrongly outputs an XML declaration when producing HTML #658

Open dechamps opened 6 years ago

dechamps commented 6 years ago

On current HEAD (f0438bd):

$ tidy --output-html yes <<EOF
<?xml version="1.0" ?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml"><head>
<title></title>
<meta charset="utf-8" />
</head><body>
<img src="foo.jpg" alt="Foo" />
<br />
</body></html>
EOF

Returns:

Info: Document content looks like HTML5
No warnings or errors were found.

<?xml version="1.0"?>
<!DOCTYPE html>
<html>
<head>
<meta name="generator" content=
"HTML Tidy for HTML5 for Linux version 5.7.0">
<title></title>
<meta charset="utf-8">
</head>
<body>
<img src="foo.jpg" alt="Foo"><br>
</body>
</html>

First of all the "Info: Document content looks like HTML5" message is a bit confusing because it's the output that's HTML5, not the input (which is XHTML5), but that's neither here nor there.

What's more problematic is that Tidy outputs an XML preamble as if it was outputting an XML file, despite the fact that the output is not well-formed XML at all (which makes sense, since I asked for HTML). The resulting document makes no sense, since it includes an XML declaration for something that is definitely not XML.

Tidy should never generate an XML declaration when the output is HTML (as opposed to XHTML).

I tried to use --add-xml-decl no, but that doesn't have any effect, as explained in the documentation:

Note that if the input already includes an <?xml ... ?> declaration then this option will be ignored.

geoffmcl commented 6 years ago

@dechamps thanks for the issue...

In simple terms, if tidy thinks, or is told to output HTML5, then maybe it should not add the xml preamble... it seems it has no place on a pure HTML5 document...

Perhaps it should warning something like - "outputing HTML5 so deleting the xml preamble" - or something like that... if one is found on the input...

And still thinking about how --add-xml-decl yes|no(default), TidyXmlDecl, should or should not influence this...

Also the -asxml, -asxhtml and -ashtml options interact... need to explore those...

Look forward to further feedback, patches, or a PR to achieve this... thanks...

bespired commented 2 weeks ago

This issue seems still open. I'm adding --add-xml-decl: no and the output file gets an added

<?xml version="1.0" encoding="utf-8">
<!DOCTYPE html>
<html prefix="og: https://ogp.me/ns#" lang="nl">
<head>
...

Is this not easy to fix? I'm using HTML Tidy for Apple macOS version 5.8.0